Uploaded image for project: 'Couchbase Python Client Library'
  1. Couchbase Python Client Library
  2. PYCBC-920

twisted api seeing lots of timeouts

    XMLWordPrintable

Details

    • 1

    Description

      For some performance tests, the python sdk twisted api is used. When upgrading the twisted api to 3.0.0 python sdk in perfrunner, I am seeing lots of timeouts from LCB. If you scroll down you can see the errors: http://perf.jenkins.couchbase.com/job/ares-tmp/42/console

      It seems only twisted api tests see this issue and im not sure if its LCB or python sdk problem.

       

      08:12:15 [2020-05-06 08:12:15,257: WARNING/Process-3:95] Connection problem with worker-93 thread-7: [Failure instance: Traceback: <class 'couchbase.exceptions.ClientTemporaryFailException'>: <RC=0x3F8[LCB_ERR_NO_CONFIGURATION (1016)], There was a problem scheduling your request, or determining the appropriate server or vBucket for the key(s) requested. This may also be a bug in the SDK if there are no network issues, C Source=(src/store.c,250)>
       
      08:12:18 [2020-05-06 08:12:18,911: WARNING/Process-3:99] Request problem with worker-97 thread-2: <Key='5d53b9-000000041779', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'ref': '', 'scope': 'scope-1', 'bucket': 'bucket-1', 'endpoint': '172.23.133.13:11210\x00', 'cas': 0, 'key': '5d53b9-000000041779', 'collection': 'collection-560', 'context': '', 'type': 'KVErrorContext', 'opaque': 3083, 'status_code': 4}, Tracing Output={"5d53b9-000000041779": {"b": "bucket-1", "t": 2500000, "i": 8158122728661777930, "debug_info":
      {"FILE": "src/callbacks.c", "FUNC": "dur_chain2", "LINE": 747}
      , "l": "172.23.133.10:38416", "s": "kv:Unknown", "c": "1ec2595593a9ac0a/50a88b5dc5aac5b9", "r": "172.23.133.13:11210"}}>
      

      In several cases these errors happen immediately and some they happen after about 5 minutes.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Ellis.Breen Ellis Breen added a comment -

          Thanks Korrigan Clark - glad we got further, but sorry to hear about the memory issues. Please raise a new Jira case for this. Some memory profiling would be useful to track down the cause - https://github.com/zhuyifei1999/guppy3/ may be useful, alongside something like gprof.

          Ellis.Breen Ellis Breen added a comment - Thanks Korrigan Clark  - glad we got further, but sorry to hear about the memory issues. Please raise a new Jira case for this. Some memory profiling would be useful to track down the cause -  https://github.com/zhuyifei1999/guppy3/  may be useful, alongside something like gprof.

          actually, my mistake. just got to the bottom of it (had the wrong log level on), and it seems like the timeouts are in fact still there and are causing the system to exhaust its memory due to worker thread log file growing larger than memory.

          [2020-06-11 12:06:13,903: WARNING/Process-5:4] Request problem with worker-2 thread-3: <Key='f73d0d-000000099957', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'type': 'KVErrorContext', 'collection': 'collection-1', 'cas': 0, 'endpoint': '172.23.133.13:11210', 'ref': '', 'scope': 'scope-1', 'opaque': 292513, 'status_code': 4, 'context': '', 'bucket': 'bucket-1', 'key': 'f73d0d-000000099957'}, Tracing Output={"f73d0d-000000099957": {"debug_info":

          {"FILE": "src/callbacks.c", "FUNC": "dur_chain2", "LINE": 747}

          , "r": "172.23.133.13:11210", "l": "172.23.133.10:47062", "t": 2500000, "s": "kv:Unknown", "b": "bucket-1", "c": "00ba185d1ccc4108/83e305781c21da4d", "i": 11248613579459944884}}>

          korrigan.clark Korrigan Clark (Inactive) added a comment - actually, my mistake. just got to the bottom of it (had the wrong log level on), and it seems like the timeouts are in fact still there and are causing the system to exhaust its memory due to worker thread log file growing larger than memory. [2020-06-11 12:06:13,903: WARNING/Process-5:4] Request problem with worker-2 thread-3: <Key='f73d0d-000000099957', RC=0xC9 [LCB_ERR_TIMEOUT (201)] , Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'type': 'KVErrorContext', 'collection': 'collection-1', 'cas': 0, 'endpoint': '172.23.133.13:11210', 'ref': '', 'scope': 'scope-1', 'opaque': 292513, 'status_code': 4, 'context': '', 'bucket': 'bucket-1', 'key': 'f73d0d-000000099957'}, Tracing Output={"f73d0d-000000099957": {"debug_info": {"FILE": "src/callbacks.c", "FUNC": "dur_chain2", "LINE": 747} , "r": "172.23.133.13:11210", "l": "172.23.133.10:47062", "t": 2500000, "s": "kv:Unknown", "b": "bucket-1", "c": "00ba185d1ccc4108/83e305781c21da4d", "i": 11248613579459944884}}>

          This is UI graph of ops for a single async worker 100% writes. Correlating this with the worker logs:

          [2020-06-11 12:51:36,841: WARNING/Process-3:2] Request problem with worker-0 thread-0: <Key='47b157-000000007113', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'opaque': 707021, 'cas': 0, 'scope': 'scope-1', 'collection': 'collection-63', 'type': 'KVErrorContext', 'endpoint': '172.23.133.13:11210', 'key': '47b157-000000007113', 'ref': '', 'status_code': 4, 'context': '', 'bucket': 'bucket-1'}, Tracing Output={"47b157-000000007113": {"t": 2500000, "debug_info":

          {"LINE": 747, "FUNC": "dur_chain2", "FILE": "src/callbacks.c"}

          , "s": "kv:Unknown", "l": "172.23.133.10:52588", "c": "65b9a25e3e4266a9/ee3372933f1e5c85", "i": 1325853535787221673, "b": "bucket-1", "r": "172.23.133.13:11210"}}>

          This is the first timeout error the worker produces, time is 12:51:36 at first error and the task began running at :

          2020-06-11T12:48:44 [INFO] Running the task on localhost

          Which gives a delta of 2 min 52 seconds before the worker sees the first timeout error

          korrigan.clark Korrigan Clark (Inactive) added a comment - This is UI graph of ops for a single async worker 100% writes. Correlating this with the worker logs: [2020-06-11 12:51:36,841: WARNING/Process-3:2] Request problem with worker-0 thread-0: <Key='47b157-000000007113', RC=0xC9 [LCB_ERR_TIMEOUT (201)] , Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'opaque': 707021, 'cas': 0, 'scope': 'scope-1', 'collection': 'collection-63', 'type': 'KVErrorContext', 'endpoint': '172.23.133.13:11210', 'key': '47b157-000000007113', 'ref': '', 'status_code': 4, 'context': '', 'bucket': 'bucket-1'}, Tracing Output={"47b157-000000007113": {"t": 2500000, "debug_info": {"LINE": 747, "FUNC": "dur_chain2", "FILE": "src/callbacks.c"} , "s": "kv:Unknown", "l": "172.23.133.10:52588", "c": "65b9a25e3e4266a9/ee3372933f1e5c85", "i": 1325853535787221673, "b": "bucket-1", "r": "172.23.133.13:11210"}}> This is the first timeout error the worker produces, time is 12:51:36 at first error and the task began running at : 2020-06-11T12:48:44 [INFO] Running the task on localhost Which gives a delta of 2 min 52 seconds before the worker sees the first timeout error

          Another run shows similar issue, timeout occuring about 3 min 18 seconds after worker starts:

          [2020-06-11 13:14:37,595: WARNING/MainProcess] celery@172-23-133-10 ready.

          pid 30068's current affinity list: 0-47

          pid 30068's new affinity list: 0

          [2020-06-11 13:17:55,454: WARNING/Process-4:2] Request problem with worker-0 thread-0: <Key='b07b7b-000000007113', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'endpoint': '172.23.133.13:11210', 'key': 'b07b7b-000000007113', 'context': '', 'collection': 'collection-89', 'status_code': 4, 'bucket': 'bucket-1', 'type': 'KVErrorContext', 'ref': '', 'scope': 'scope-1', 'cas': 0, 'opaque': 707018}, Tracing Output={"b07b7b-000000007113": {"s": "kv:Unknown", "b": "bucket-1", "r": "172.23.133.13:11210", "l": "172.23.133.10:54250", "c": "59125001cbadafd8/007919a78dd1f3bb", "debug_info":

          {"LINE": 747, "FILE": "src/callbacks.c", "FUNC": "dur_chain2"}

          , "t": 2500000, "i": 10273010956035881179}}>

          korrigan.clark Korrigan Clark (Inactive) added a comment - Another run shows similar issue, timeout occuring about 3 min 18 seconds after worker starts: [2020-06-11 13:14:37,595: WARNING/MainProcess] celery@172-23-133-10 ready. pid 30068's current affinity list: 0-47 pid 30068's new affinity list: 0 [2020-06-11 13:17:55,454: WARNING/Process-4:2] Request problem with worker-0 thread-0: <Key='b07b7b-000000007113', RC=0xC9 [LCB_ERR_TIMEOUT (201)] , Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'endpoint': '172.23.133.13:11210', 'key': 'b07b7b-000000007113', 'context': '', 'collection': 'collection-89', 'status_code': 4, 'bucket': 'bucket-1', 'type': 'KVErrorContext', 'ref': '', 'scope': 'scope-1', 'cas': 0, 'opaque': 707018}, Tracing Output={"b07b7b-000000007113": {"s": "kv:Unknown", "b": "bucket-1", "r": "172.23.133.13:11210", "l": "172.23.133.10:54250", "c": "59125001cbadafd8/007919a78dd1f3bb", "debug_info": {"LINE": 747, "FILE": "src/callbacks.c", "FUNC": "dur_chain2"} , "t": 2500000, "i": 10273010956035881179}}>

          Spent some time looking deeper and it seems the issue is a memory leak, but not sure where yet.

          What I found was that my code was setting the twisted cluster object kv timeout incorrectly and no error was thrown. Set the timeout properly after digging through pycbc code. Now I see no errors logged anywhere and things looks fine from the timeout side of things

          Running with 1 worker I was able to see that the memory leaks at a proportional rate to the throughput coming from the worker. Try with very low throughput, 100 ops/sec, the memory leak still persists. So I looked at one of the growing memory addresses of the worker process to see what was being added and it looks like similar errors as above comments. The strange thing is that I dont see these errors anywhere logged in perfrunner like I could see before.

          korrigan.clark Korrigan Clark (Inactive) added a comment - Spent some time looking deeper and it seems the issue is a memory leak, but not sure where yet. What I found was that my code was setting the twisted cluster object kv timeout incorrectly and no error was thrown. Set the timeout properly after digging through pycbc code. Now I see no errors logged anywhere and things looks fine from the timeout side of things Running with 1 worker I was able to see that the memory leaks at a proportional rate to the throughput coming from the worker. Try with very low throughput, 100 ops/sec, the memory leak still persists. So I looked at one of the growing memory addresses of the worker process to see what was being added and it looks like similar errors as above comments. The strange thing is that I dont see these errors anywhere logged in perfrunner like I could see before.

          People

            korrigan.clark Korrigan Clark (Inactive)
            korrigan.clark Korrigan Clark (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty