Uploaded image for project: 'Couchbase Python Client Library'
  1. Couchbase Python Client Library
  2. PYCBC-920

twisted api seeing lots of timeouts

    XMLWordPrintable

Details

    • 1

    Description

      For some performance tests, the python sdk twisted api is used. When upgrading the twisted api to 3.0.0 python sdk in perfrunner, I am seeing lots of timeouts from LCB. If you scroll down you can see the errors: http://perf.jenkins.couchbase.com/job/ares-tmp/42/console

      It seems only twisted api tests see this issue and im not sure if its LCB or python sdk problem.

       

      08:12:15 [2020-05-06 08:12:15,257: WARNING/Process-3:95] Connection problem with worker-93 thread-7: [Failure instance: Traceback: <class 'couchbase.exceptions.ClientTemporaryFailException'>: <RC=0x3F8[LCB_ERR_NO_CONFIGURATION (1016)], There was a problem scheduling your request, or determining the appropriate server or vBucket for the key(s) requested. This may also be a bug in the SDK if there are no network issues, C Source=(src/store.c,250)>
       
      08:12:18 [2020-05-06 08:12:18,911: WARNING/Process-3:99] Request problem with worker-97 thread-2: <Key='5d53b9-000000041779', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'ref': '', 'scope': 'scope-1', 'bucket': 'bucket-1', 'endpoint': '172.23.133.13:11210\x00', 'cas': 0, 'key': '5d53b9-000000041779', 'collection': 'collection-560', 'context': '', 'type': 'KVErrorContext', 'opaque': 3083, 'status_code': 4}, Tracing Output={"5d53b9-000000041779": {"b": "bucket-1", "t": 2500000, "i": 8158122728661777930, "debug_info":
      {"FILE": "src/callbacks.c", "FUNC": "dur_chain2", "LINE": 747}
      , "l": "172.23.133.10:38416", "s": "kv:Unknown", "c": "1ec2595593a9ac0a/50a88b5dc5aac5b9", "r": "172.23.133.13:11210"}}>
      

      In several cases these errors happen immediately and some they happen after about 5 minutes.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          korrigan.clark Korrigan Clark (Inactive) created issue -
          korrigan.clark Korrigan Clark (Inactive) made changes -
          Field Original Value New Value
          Description For some performance tests, the python sdk twisted api is used. When upgrading the twisted api to 3.0.0 python sdk in perfrunner, I am seeing lots of timeouts from LCB. If you scroll down you can see the errors: [http://perf.jenkins.couchbase.com/job/ares-tmp/42/console]

          It seems only twisted api tests see this issue and im not sure if its LCB or python sdk problem. 
          08:12:15 [2020-05-06 08:12:15,257: WARNING/Process-3:95] Connection problem with worker-93 thread-7: [Failure instance: Traceback: <class 'couchbase.exceptions.ClientTemporaryFailException'>: <RC=0x3F8[LCB_ERR_NO_CONFIGURATION (1016)], There was a problem scheduling your request, or determining the appropriate server or vBucket for the key(s) requested. This may also be a bug in the SDK if there are no network issues, C Source=(src/store.c,250)>
          08:12:18 [2020-05-06 08:12:18,911: WARNING/Process-3:99] Request problem with worker-97 thread-2: <Key='5d53b9-000000041779', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context=\{'ref': '', 'scope': 'scope-1', 'bucket': 'bucket-1', 'endpoint': '172.23.133.13:11210\x00', 'cas': 0, 'key': '5d53b9-000000041779', 'collection': 'collection-560', 'context': '', 'type': 'KVErrorContext', 'opaque': 3083, 'status_code': 4}, Tracing Output=\{"5d53b9-000000041779": {"b": "bucket-1", "t": 2500000, "i": 8158122728661777930, "debug_info": {"FILE": "src/callbacks.c", "FUNC": "dur_chain2", "LINE": 747}, "l": "172.23.133.10:38416", "s": "kv:Unknown", "c": "1ec2595593a9ac0a/50a88b5dc5aac5b9", "r": "172.23.133.13:11210"}}>
          In several cases these errors happen immediately and some they happen after about 5 minutes.
          For some performance tests, the python sdk twisted api is used. When upgrading the twisted api to 3.0.0 python sdk in perfrunner, I am seeing lots of timeouts from LCB. If you scroll down you can see the errors: [http://perf.jenkins.couchbase.com/job/ares-tmp/42/console]

          It seems only twisted api tests see this issue and im not sure if its LCB or python sdk problem.

           
          {code:java}
          08:12:15 [2020-05-06 08:12:15,257: WARNING/Process-3:95] Connection problem with worker-93 thread-7: [Failure instance: Traceback: <class 'couchbase.exceptions.ClientTemporaryFailException'>: <RC=0x3F8[LCB_ERR_NO_CONFIGURATION (1016)], There was a problem scheduling your request, or determining the appropriate server or vBucket for the key(s) requested. This may also be a bug in the SDK if there are no network issues, C Source=(src/store.c,250)>
           
          08:12:18 [2020-05-06 08:12:18,911: WARNING/Process-3:99] Request problem with worker-97 thread-2: <Key='5d53b9-000000041779', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'ref': '', 'scope': 'scope-1', 'bucket': 'bucket-1', 'endpoint': '172.23.133.13:11210\x00', 'cas': 0, 'key': '5d53b9-000000041779', 'collection': 'collection-560', 'context': '', 'type': 'KVErrorContext', 'opaque': 3083, 'status_code': 4}, Tracing Output={"5d53b9-000000041779": {"b": "bucket-1", "t": 2500000, "i": 8158122728661777930, "debug_info":
          {"FILE": "src/callbacks.c", "FUNC": "dur_chain2", "LINE": 747}
          , "l": "172.23.133.10:38416", "s": "kv:Unknown", "c": "1ec2595593a9ac0a/50a88b5dc5aac5b9", "r": "172.23.133.13:11210"}}>
          {code}

           In several cases these errors happen immediately and some they happen after about 5 minutes.
          Ellis.Breen Ellis Breen made changes -
          Status New [ 10003 ] Open [ 1 ]
          Ellis.Breen Ellis Breen made changes -
          Fix Version/s 3.0.1 [ 16766 ]
          brett19 Brett Lawson made changes -
          Priority Major [ 3 ] Blocker [ 1 ]
          Ellis.Breen Ellis Breen added a comment -

          Just looked at this issue and in the first exception it appears operations are being attempted prior to bootstrap completing. Is this with a standard Twisted KV/service operation? They should theoretically all requeue themselves if the connection is not bootstrapped at time of calling. What happens if you add your operations as a callback to the Deferred returned by collection.on_connect()?

           

          The second looks like a timeout operation in dur_chain2 (is this the one you see after 5 minutes or so). It could be because of server load or some other latency/bottleneck issue. Possibly a durability timeout issue? You could try raising the durability timeout, as this would normally add up under load.

           

          Ellis.Breen Ellis Breen added a comment - Just looked at this issue and in the first exception it appears operations are being attempted prior to bootstrap completing. Is this with a standard Twisted KV/service operation? They should theoretically all requeue themselves if the connection is not bootstrapped at time of calling. What happens if you add your operations as a callback to the Deferred returned by collection.on_connect()?   The second looks like a timeout operation in dur_chain2 (is this the one you see after 5 minutes or so). It could be because of server load or some other latency/bottleneck issue. Possibly a durability timeout issue? You could try raising the durability timeout, as this would normally add up under load.  
          Ellis.Breen Ellis Breen made changes -
          Assignee Ellis Breen [ ellis.breen ] Korrigan Clark [ korrigan.clark ]

          This is the on_connect code: https://github.com/couchbase/perfrunner/blob/master/spring/wgen3.py#L603

           

          The test is not using any durability, i.e durability=None and replicate_to=0 and persist_to=0. There should be no durability errors. Furthermore, the timeout is set to 60 seconds. Here is a more recent run: http://perf.jenkins.couchbase.com/job/ares/14073/consoleFull

          In this test there are 100 workers with batch size of 1000. All the workers timeout and restart new batches. The server is able to handle 100k ops in a 60 second period as seen in other tests. This is an identical test using old async api in python sdk 2: http://perf.jenkins.couchbase.com/job/ares/14009/. The only difference between the tests is the twisted python sdk upgrade and the addition of collections.

          If I disable the twisted api for this test I can get similar results to the old twisted api test.

          korrigan.clark Korrigan Clark (Inactive) added a comment - This is the on_connect code:  https://github.com/couchbase/perfrunner/blob/master/spring/wgen3.py#L603   The test is not using any durability, i.e durability=None and replicate_to=0 and persist_to=0. There should be no durability errors. Furthermore, the timeout is set to 60 seconds. Here is a more recent run:  http://perf.jenkins.couchbase.com/job/ares/14073/consoleFull In this test there are 100 workers with batch size of 1000. All the workers timeout and restart new batches. The server is able to handle 100k ops in a 60 second period as seen in other tests. This is an identical test using old async api in python sdk 2:  http://perf.jenkins.couchbase.com/job/ares/14009/ . The only difference between the tests is the twisted python sdk upgrade and the addition of collections. If I disable the twisted api for this test I can get similar results to the old twisted api test.
          korrigan.clark Korrigan Clark (Inactive) made changes -
          Assignee Korrigan Clark [ korrigan.clark ] Ellis Breen [ ellis.breen ]
          Ellis.Breen Ellis Breen made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Ellis.Breen Ellis Breen added a comment -

          Hi, the timeout errors look quite similar to an issue I am currently working on that seems only to affect Centos 7/8: https://issues.couchbase.com/browse/PYCBC-928

          There is a DNS SRV related issue fixed in LCB that seems to have at least reduced the number of timeouts:

          https://issues.couchbase.com/browse/CCBC-1234

          I have a version of PYCBC that pulls in this fix which I am evaluating (Sergey has already approved it for merge, but just wanted to see how it impacted on this issue)

          You can try the patch as follows:

           

          pip install git+http://review.couchbase.org/couchbase-python-client@refs/changes/65/128665/1
           
          

          With respect to the NOT_CONFIGURED error, it appears to me you are using on_connect correctly, so perhaps something else is at play - not sure what yet. Could you try enabling LCB_LOGLEVEL=5 and running a short run (the logging will even more plentiful than the ones normally produced)?

           

          Ellis.Breen Ellis Breen added a comment - Hi, the timeout errors look quite similar to an issue I am currently working on that seems only to affect Centos 7/8:  https://issues.couchbase.com/browse/PYCBC-928 There is a DNS SRV related issue fixed in LCB that seems to have at least reduced the number of timeouts: https://issues.couchbase.com/browse/CCBC-1234 I have a version of PYCBC that pulls in this fix which I am evaluating (Sergey has already approved it for merge, but just wanted to see how it impacted on this issue) You can try the patch as follows:   pip install git+http: //review.couchbase.org/couchbase-python-client@refs/changes/65/128665/1   With respect to the NOT_CONFIGURED error, it appears to me you are using on_connect correctly, so perhaps something else is at play - not sure what yet. Could you try enabling LCB_LOGLEVEL=5 and running a short run (the logging will even more plentiful than the ones normally produced)?  

          Ellis Breen

          I reran the test with varying amount of async workers (they use the twisted api). This test has 10 workers: http://perf.jenkins.couchbase.com/job/ares/14163/. There is a load phase which uses regular python sdk (non-twisted) followed by the access phase which uses 10 twisted workers. It looks like there is less timeouts during load phase and access phase but still seeing access phase timeouts. Another interesting thing you can see is that there are perfrunner errors related to filedescriptors at the end of the test.

          korrigan.clark Korrigan Clark (Inactive) added a comment - Ellis Breen I reran the test with varying amount of async workers (they use the twisted api). This test has 10 workers:  http://perf.jenkins.couchbase.com/job/ares/14163/ . There is a load phase which uses regular python sdk (non-twisted) followed by the access phase which uses 10 twisted workers. It looks like there is less timeouts during load phase and access phase but still seeing access phase timeouts. Another interesting thing you can see is that there are perfrunner errors related to filedescriptors at the end of the test.
          korrigan.clark Korrigan Clark (Inactive) added a comment - 1 async access worker:  http://perf.jenkins.couchbase.com/job/ares/14165/ 2 async access workers:  http://perf.jenkins.couchbase.com/job/ares/14166/
          wayne Wayne Siu made changes -
          Labels performance affects-cc-testing performance
          Ellis.Breen Ellis Breen added a comment -

          Hi Korrigan - thanks for the logs. We're still seeing this on our test systems against the mock so I am continuing to work on that first. Are you able to run default_collection only tests with SDK3 and older servers? Do you still see the timeout errors? Trying to determine whether this is a server or client-related issue.

          Ellis.Breen Ellis Breen added a comment - Hi Korrigan - thanks for the logs. We're still seeing this on our test systems against the mock so I am continuing to work on that first. Are you able to run default_collection only tests with SDK3 and older servers? Do you still see the timeout errors? Trying to determine whether this is a server or client-related issue.
          Ellis.Breen Ellis Breen added a comment -

          Note that, as mentioned in https://issues.couchbase.com/browse/PYCBC-928, we see similar timeout errors against the mock and SDK3 (but only on Centos, which is the OS Korrigan is using), it may be that this is a client-side issue rather than server-side. 

           It *seems* that disabling IPV6 and using the DNS SRV fix in LCB (now merged) reduce (but don’t eliminate) the incidence rate of these issues – and interestingly, we don’t see these timeout issues much against Centos on our Jenkins integration tests. IPV6 issues would make sense with the Mock as IPV6 loopback behaviour can be rather different to that of IPV4 – although why this only seen on Centos remains to be seen. Regardless, there seems to be something about Centos that the client has problems with.

          Ellis.Breen Ellis Breen added a comment - Note that, as mentioned in https://issues.couchbase.com/browse/PYCBC-928 , we see similar timeout errors against the mock and SDK3 (but only on Centos, which is the OS Korrigan is using), it may be that this is a client-side issue rather than server-side.   It * seems * that disabling IPV6 and using the DNS SRV fix in LCB (now merged) reduce (but don’t eliminate) the incidence rate of these issues – and interestingly, we don’t see these timeout issues much against Centos on our Jenkins integration tests. IPV6 issues would make sense with the Mock as IPV6 loopback behaviour can be rather different to that of IPV4 – although why this only seen on Centos remains to be seen. Regardless, there seems to be something about Centos that the client has problems with.

          Running some tests now against default collection only and older builds now. Will update when complete.

           

          Can you tell me how you are disabling IPV6?  I can try some runs with that too.

          korrigan.clark Korrigan Clark (Inactive) added a comment - Running some tests now against default collection only and older builds now. Will update when complete.   Can you tell me how you are disabling IPV6?  I can try some runs with that too.
          Ellis.Breen Ellis Breen added a comment -

          Hi, to disable ipv6, add the option 'ipv6=disabled' to your connection string (see https://docs.couchbase.com/c-sdk/2.10/client-settings.html#settings-list for details) - thanks!

          Ellis.Breen Ellis Breen added a comment - Hi, to disable ipv6, add the option 'ipv6=disabled' to your connection string (see  https://docs.couchbase.com/c-sdk/2.10/client-settings.html#settings-list  for details) - thanks!
          korrigan.clark Korrigan Clark (Inactive) made changes -
          Attachment logs1622.txt [ 95784 ]

          logs1622.txt

          I grabbed a snippet of the logs because the job logs are way to huge

          korrigan.clark Korrigan Clark (Inactive) added a comment - logs1622.txt I grabbed a snippet of the logs because the job logs are way to huge
          korrigan.clark Korrigan Clark (Inactive) made changes -
          Comment [ [~Ellis.Breen] here is a run with 7.0.0-1622 and only targeting default collection.  The console output is huge due to lcb logging: [http://perf.jenkins.couchbase.com/job/ares/14310/]. I recommend downloading the text version of the console output ]
          Ellis.Breen Ellis Breen added a comment -

          Hi Korrigan, we managed to diagnose the cause of timeout errors in PYCBC-928, that being Centos running out of file handles (possibly sockets). It may be that raising the limit of these with 'ulimit -n' may fix this at least until we find any handle leaks, so I would recommend that as an attempt at a workaround.

          It would also be helpful in diagnosing this issue if you could run the code with and without the ulimit changed with LCB_LOGLEVEL=5 and couchbase.enable_logging() set - see https://docs.couchbase.com/python-sdk/3.0/howtos/collecting-information-and-logging.html

          for full details. 

          N.B. We will need the stderr output of the Python client for this, so you may need to adjust the perf test framework to do this.

          Ellis.Breen Ellis Breen added a comment - Hi Korrigan, we managed to diagnose the cause of timeout errors in PYCBC-928 , that being Centos running out of file handles (possibly sockets). It may be that raising the limit of these with 'ulimit -n' may fix this at least until we find any handle leaks, so I would recommend that as an attempt at a workaround. It would also be helpful in diagnosing this issue if you could run the code with and without the ulimit changed with LCB_LOGLEVEL=5 and couchbase.enable_logging() set - see  https://docs.couchbase.com/python-sdk/3.0/howtos/collecting-information-and-logging.html for full details.  N.B. We will need the stderr output of the Python client for this, so you may need to adjust the perf test framework to do this.
          Ellis.Breen Ellis Breen made changes -
          Assignee Ellis Breen [ ellis.breen ] Korrigan Clark [ korrigan.clark ]

          Rerun with ulimit -n, latest python client commit, no additional logging: http://perf.jenkins.couchbase.com/job/ares-tmp/55/

          I will need to make some code changes to allow couchbase logging so might be delayed a bit getting this to you

          korrigan.clark Korrigan Clark (Inactive) added a comment - Rerun with ulimit -n, latest python client commit, no additional logging:  http://perf.jenkins.couchbase.com/job/ares-tmp/55/ I will need to make some code changes to allow couchbase logging so might be delayed a bit getting this to you
          Ellis.Breen Ellis Breen made changes -
          Fix Version/s 3.0.2 [ 16826 ]
          Fix Version/s 3.0.1 [ 16766 ]

          Korrigan Clark is this still a blocker ?

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - Korrigan Clark is this still a blocker ?
          Ellis.Breen Ellis Breen added a comment -

          Note that a number of LCB patches related to retry were just released and are now being used in PYCBC - hopefully these will have a positive impact on this issue (we stopped seeing timeouts here after conserving socket usage by reusing the Mock)

          Ellis.Breen Ellis Breen added a comment - Note that a number of LCB patches related to retry were just released and are now being used in PYCBC - hopefully these will have a positive impact on this issue (we stopped seeing timeouts here after conserving socket usage by reusing the Mock)

          Mihir Kamdar I need to test with new pycbc commits, it may be resolved but just havent checked yet. Will do so shortly. I think we can keep it as blocker until a verify.

          korrigan.clark Korrigan Clark (Inactive) added a comment - Mihir Kamdar  I need to test with new pycbc commits, it may be resolved but just havent checked yet. Will do so shortly. I think we can keep it as blocker until a verify.

          So after running tests with the latests commit of pycbc, it seems like the timeout errors are gone. However, now there are new errors showing up.

          The test is just 100% writes by N workers, very simple. I have raised the open file and process limits to their max and am now seeing memory being used up completely on the client machines. This happens regardless of the number of workers and it happens right about when the total number of docs in couchbase reach 10-12 million. Docs are 1KB so the total doc size is around 10-12 GB. The client machine has 64 GB of ram and when it hits the memory error a quick free -m shows almost 0 free memory. Ellis Breen 

          korrigan.clark Korrigan Clark (Inactive) added a comment - So after running tests with the latests commit of pycbc, it seems like the timeout errors are gone. However, now there are new errors showing up. The test is just 100% writes by N workers, very simple. I have raised the open file and process limits to their max and am now seeing memory being used up completely on the client machines. This happens regardless of the number of workers and it happens right about when the total number of docs in couchbase reach 10-12 million. Docs are 1KB so the total doc size is around 10-12 GB. The client machine has 64 GB of ram and when it hits the memory error a quick free -m shows almost 0 free memory. Ellis Breen  
          Ellis.Breen Ellis Breen added a comment -

          Thanks Korrigan Clark - glad we got further, but sorry to hear about the memory issues. Please raise a new Jira case for this. Some memory profiling would be useful to track down the cause - https://github.com/zhuyifei1999/guppy3/ may be useful, alongside something like gprof.

          Ellis.Breen Ellis Breen added a comment - Thanks Korrigan Clark  - glad we got further, but sorry to hear about the memory issues. Please raise a new Jira case for this. Some memory profiling would be useful to track down the cause -  https://github.com/zhuyifei1999/guppy3/  may be useful, alongside something like gprof.

          actually, my mistake. just got to the bottom of it (had the wrong log level on), and it seems like the timeouts are in fact still there and are causing the system to exhaust its memory due to worker thread log file growing larger than memory.

          [2020-06-11 12:06:13,903: WARNING/Process-5:4] Request problem with worker-2 thread-3: <Key='f73d0d-000000099957', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'type': 'KVErrorContext', 'collection': 'collection-1', 'cas': 0, 'endpoint': '172.23.133.13:11210', 'ref': '', 'scope': 'scope-1', 'opaque': 292513, 'status_code': 4, 'context': '', 'bucket': 'bucket-1', 'key': 'f73d0d-000000099957'}, Tracing Output={"f73d0d-000000099957": {"debug_info":

          {"FILE": "src/callbacks.c", "FUNC": "dur_chain2", "LINE": 747}

          , "r": "172.23.133.13:11210", "l": "172.23.133.10:47062", "t": 2500000, "s": "kv:Unknown", "b": "bucket-1", "c": "00ba185d1ccc4108/83e305781c21da4d", "i": 11248613579459944884}}>

          korrigan.clark Korrigan Clark (Inactive) added a comment - actually, my mistake. just got to the bottom of it (had the wrong log level on), and it seems like the timeouts are in fact still there and are causing the system to exhaust its memory due to worker thread log file growing larger than memory. [2020-06-11 12:06:13,903: WARNING/Process-5:4] Request problem with worker-2 thread-3: <Key='f73d0d-000000099957', RC=0xC9 [LCB_ERR_TIMEOUT (201)] , Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'type': 'KVErrorContext', 'collection': 'collection-1', 'cas': 0, 'endpoint': '172.23.133.13:11210', 'ref': '', 'scope': 'scope-1', 'opaque': 292513, 'status_code': 4, 'context': '', 'bucket': 'bucket-1', 'key': 'f73d0d-000000099957'}, Tracing Output={"f73d0d-000000099957": {"debug_info": {"FILE": "src/callbacks.c", "FUNC": "dur_chain2", "LINE": 747} , "r": "172.23.133.13:11210", "l": "172.23.133.10:47062", "t": 2500000, "s": "kv:Unknown", "b": "bucket-1", "c": "00ba185d1ccc4108/83e305781c21da4d", "i": 11248613579459944884}}>
          korrigan.clark Korrigan Clark (Inactive) made changes -

          This is UI graph of ops for a single async worker 100% writes. Correlating this with the worker logs:

          [2020-06-11 12:51:36,841: WARNING/Process-3:2] Request problem with worker-0 thread-0: <Key='47b157-000000007113', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'opaque': 707021, 'cas': 0, 'scope': 'scope-1', 'collection': 'collection-63', 'type': 'KVErrorContext', 'endpoint': '172.23.133.13:11210', 'key': '47b157-000000007113', 'ref': '', 'status_code': 4, 'context': '', 'bucket': 'bucket-1'}, Tracing Output={"47b157-000000007113": {"t": 2500000, "debug_info":

          {"LINE": 747, "FUNC": "dur_chain2", "FILE": "src/callbacks.c"}

          , "s": "kv:Unknown", "l": "172.23.133.10:52588", "c": "65b9a25e3e4266a9/ee3372933f1e5c85", "i": 1325853535787221673, "b": "bucket-1", "r": "172.23.133.13:11210"}}>

          This is the first timeout error the worker produces, time is 12:51:36 at first error and the task began running at :

          2020-06-11T12:48:44 [INFO] Running the task on localhost

          Which gives a delta of 2 min 52 seconds before the worker sees the first timeout error

          korrigan.clark Korrigan Clark (Inactive) added a comment - This is UI graph of ops for a single async worker 100% writes. Correlating this with the worker logs: [2020-06-11 12:51:36,841: WARNING/Process-3:2] Request problem with worker-0 thread-0: <Key='47b157-000000007113', RC=0xC9 [LCB_ERR_TIMEOUT (201)] , Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'opaque': 707021, 'cas': 0, 'scope': 'scope-1', 'collection': 'collection-63', 'type': 'KVErrorContext', 'endpoint': '172.23.133.13:11210', 'key': '47b157-000000007113', 'ref': '', 'status_code': 4, 'context': '', 'bucket': 'bucket-1'}, Tracing Output={"47b157-000000007113": {"t": 2500000, "debug_info": {"LINE": 747, "FUNC": "dur_chain2", "FILE": "src/callbacks.c"} , "s": "kv:Unknown", "l": "172.23.133.10:52588", "c": "65b9a25e3e4266a9/ee3372933f1e5c85", "i": 1325853535787221673, "b": "bucket-1", "r": "172.23.133.13:11210"}}> This is the first timeout error the worker produces, time is 12:51:36 at first error and the task began running at : 2020-06-11T12:48:44 [INFO] Running the task on localhost Which gives a delta of 2 min 52 seconds before the worker sees the first timeout error
          korrigan.clark Korrigan Clark (Inactive) made changes -

          Another run shows similar issue, timeout occuring about 3 min 18 seconds after worker starts:

          [2020-06-11 13:14:37,595: WARNING/MainProcess] celery@172-23-133-10 ready.

          pid 30068's current affinity list: 0-47

          pid 30068's new affinity list: 0

          [2020-06-11 13:17:55,454: WARNING/Process-4:2] Request problem with worker-0 thread-0: <Key='b07b7b-000000007113', RC=0xC9[LCB_ERR_TIMEOUT (201)], Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'endpoint': '172.23.133.13:11210', 'key': 'b07b7b-000000007113', 'context': '', 'collection': 'collection-89', 'status_code': 4, 'bucket': 'bucket-1', 'type': 'KVErrorContext', 'ref': '', 'scope': 'scope-1', 'cas': 0, 'opaque': 707018}, Tracing Output={"b07b7b-000000007113": {"s": "kv:Unknown", "b": "bucket-1", "r": "172.23.133.13:11210", "l": "172.23.133.10:54250", "c": "59125001cbadafd8/007919a78dd1f3bb", "debug_info":

          {"LINE": 747, "FILE": "src/callbacks.c", "FUNC": "dur_chain2"}

          , "t": 2500000, "i": 10273010956035881179}}>

          korrigan.clark Korrigan Clark (Inactive) added a comment - Another run shows similar issue, timeout occuring about 3 min 18 seconds after worker starts: [2020-06-11 13:14:37,595: WARNING/MainProcess] celery@172-23-133-10 ready. pid 30068's current affinity list: 0-47 pid 30068's new affinity list: 0 [2020-06-11 13:17:55,454: WARNING/Process-4:2] Request problem with worker-0 thread-0: <Key='b07b7b-000000007113', RC=0xC9 [LCB_ERR_TIMEOUT (201)] , Operational Error, Results=1, C Source=(src/multiresult.c,316), Context={'endpoint': '172.23.133.13:11210', 'key': 'b07b7b-000000007113', 'context': '', 'collection': 'collection-89', 'status_code': 4, 'bucket': 'bucket-1', 'type': 'KVErrorContext', 'ref': '', 'scope': 'scope-1', 'cas': 0, 'opaque': 707018}, Tracing Output={"b07b7b-000000007113": {"s": "kv:Unknown", "b": "bucket-1", "r": "172.23.133.13:11210", "l": "172.23.133.10:54250", "c": "59125001cbadafd8/007919a78dd1f3bb", "debug_info": {"LINE": 747, "FILE": "src/callbacks.c", "FUNC": "dur_chain2"} , "t": 2500000, "i": 10273010956035881179}}>

          Spent some time looking deeper and it seems the issue is a memory leak, but not sure where yet.

          What I found was that my code was setting the twisted cluster object kv timeout incorrectly and no error was thrown. Set the timeout properly after digging through pycbc code. Now I see no errors logged anywhere and things looks fine from the timeout side of things

          Running with 1 worker I was able to see that the memory leaks at a proportional rate to the throughput coming from the worker. Try with very low throughput, 100 ops/sec, the memory leak still persists. So I looked at one of the growing memory addresses of the worker process to see what was being added and it looks like similar errors as above comments. The strange thing is that I dont see these errors anywhere logged in perfrunner like I could see before.

          korrigan.clark Korrigan Clark (Inactive) added a comment - Spent some time looking deeper and it seems the issue is a memory leak, but not sure where yet. What I found was that my code was setting the twisted cluster object kv timeout incorrectly and no error was thrown. Set the timeout properly after digging through pycbc code. Now I see no errors logged anywhere and things looks fine from the timeout side of things Running with 1 worker I was able to see that the memory leaks at a proportional rate to the throughput coming from the worker. Try with very low throughput, 100 ops/sec, the memory leak still persists. So I looked at one of the growing memory addresses of the worker process to see what was being added and it looks like similar errors as above comments. The strange thing is that I dont see these errors anywhere logged in perfrunner like I could see before.
          korrigan.clark Korrigan Clark (Inactive) made changes -
          Resolution Fixed [ 1 ]
          Status In Progress [ 3 ] Resolved [ 5 ]
          korrigan.clark Korrigan Clark (Inactive) made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          ingenthr Matt Ingenthron made changes -
          Workflow Couchbase SDK Workflow [ 167991 ] Couchbase SDK Workflow with Review [ 257372 ]

          People

            korrigan.clark Korrigan Clark (Inactive)
            korrigan.clark Korrigan Clark (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty