Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45911

Eventing: LCB Exception 13014 seen for N1QL handler in Collection + Rebalance Tests

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown

    Description

      Build - 7.0.0 - 5017
      Test 1

      ./testrunner -i /tmp/testexec.32269.ini -p get-cbcollect-info=True,GROUP=bucket_op_p1,get-cbcollect-info=True -t eventing.eventing_rebalance_collection.EventingRebalanceCollection.test_rebalance_out_with_different_topologies,doc-per-day=20,dataset=default,nodes_init=4,services_init=kv-eventing:n1ql-eventing:n1ql-eventing:index,groups=simple,server_out=2,reset_services=True,GROUP=bucket_op_p1,java_sdk_client=True,get-cbcollect-info=False
       
      Exception: duplicated data in destination bucket which is not expected. Current : 1 Expected : 0  dcp_backlog : 0  TIMERS_IN_PAST : 0 lcb_exceptions : {'172.23.122.113': {}, '172.23.122.114': {}}
       
      "n1ql_op_exception_count": 1
       
      "lcb_exception_stats": {
                  "13014": 1
              }
      

      Test 2

      ./testrunner -i /tmp/testexec.32269.ini -p get-cbcollect-info=True,GROUP=bucket_op_p1,get-cbcollect-info=True -t eventing.eventing_rebalance_collection.EventingRebalanceCollection.test_swap_rebalance_with_different_topologies,doc-per-day=20,dataset=default,nodes_init=4,services_init=kv-eventing:n1ql-eventing:n1ql-eventing:index,groups=simple,server_out=2,services_in=eventing,n1ql,reset_services=True,GROUP=bucket_op_p1,java_sdk_client=True,get-cbcollect-info=False
       
      duplicated data in destination bucket which is not expected. Current : 4 Expected : 0  dcp_backlog : 0  TIMERS_IN_PAST : 0 lcb_exceptions : {'172.23.122.113': {'13014': 2}, '172.23.122.114': {}, '172.23.122.90': {}}
       
      "n1ql_op_exception_count": 2
       
      "lcb_exception_stats": {
                  "13014": 2
              }
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            From lcb_exception_stats there are 3 201s - lcb_timeout , 3 207s - LCB_ERR_TEMPORARY_FAILURE , 2 12009 - CAS mismatch (something else was writing in parallel to the query), 2 405s - N1ql DML error.

            I believe cbcollect was taken in this test:

            ./testrunner -i /tmp/testexec.32269.ini -p get-cbcollect-info=True,GROUP=bucket_op_p1,get-cbcollect-info=True -t

            Query, kv and eventing logs will help in debugging further.

            abhishek.jindal Abhishek Jindal added a comment - From lcb_exception_stats there are 3 201s - lcb_timeout , 3 207s - LCB_ERR_TEMPORARY_FAILURE , 2 12009 - CAS mismatch (something else was writing in parallel to the query), 2 405s - N1ql DML error. I believe cbcollect was taken in this test: ./testrunner -i /tmp/testexec.32269.ini -p get-cbcollect-info=True,GROUP=bucket_op_p1, get-cbcollect-info=True -t Query, kv and eventing logs will help in debugging further.

            Timeouts and DML errors were observed due to the cluster being undersized.

            Note that most of the timeouts and temporary failure exceptions happened around the 2021-05-06T05:04 to 2021-05-06T05:07 timeframe.

            Temporary failure exceptions:

            2021-05-06T05:04:56.721-07:00 [Info] eventing-consumer [worker_sbm_0:/tmp/127.0.0.1:8091_0_1961136987.sock:26433] [lcb,retryq L:164 I:115587435] Failing command (pkt=0x7f97a4034290, opaque=4839, retries=31, time=5000090us, status=0x82) requested error: LCB_ERR_TIMEOUT (201), from retry queue: LCB_ERR_TEMPORARY_FAILURE (207)
            

            N1QL DML failures:

            2021-05-06T05:07:01.223-07:00 [Info] eventing-consumer [worker_n1ql_0:/tmp/127.0.0.1:8091_0_3004449606.sock:26473] {  "message": "SDK error : LCB_ERR_DML_FAILURE (405) Query error : {\n\"requestID\": \"67ed5bcb-4114-4912-8494-64838a971a5c\",\n\"clientContextID\": \"4@n1ql.js(OnUpdate)\",\n\"signature\": null,\n\"results\": [\n],\n\"errors\": [{\"code\":12009,\"msg\":\"DML Error, possible causes include concurrent modificationFailed to perform INSERT - cause: Duplicate Key doc_39532\"}],\n\"status\": \"errors\",\n\"metrics\": {\"elapsedTime\": \"2.436922ms\",\"executionTime\": \"2.423477ms\",\"resultCount\": 0,\"resultSize\": 0,\"serviceLoad\": 25,\"errorCount\": 1}\n}\n",  "stack": "Error\n    at OnUpdate (n1ql.js:4:17)"} Line: 3 Code: var query = N1QL('INSERT INTO dst_bucket.scope_1.coll_4 ( KEY, VALUE ) VALUES ( $docId ,\'Hello World\');', {'$docId':docId}, { 'consistency': 'none' }); Stack: Error    at OnUpdate (n1ql.js:4:17)
            

            Around this time, memory usage on KV node .101 crossed high watermark of 178 MB:

            This resulted in tmp_oom errors which converts to LCB_ERR_TEMPORARY_FAILURE and hence timeouts for bucket ops and DML errors for n1ql queries:

            abhishek.jindal Abhishek Jindal added a comment - Timeouts and DML errors were observed due to the cluster being undersized. Note that most of the timeouts and temporary failure exceptions happened around the 2021-05-06T05:04 to 2021-05-06T05:07 timeframe. Temporary failure exceptions: 2021-05-06T05:04:56.721-07:00 [Info] eventing-consumer [worker_sbm_0:/tmp/127.0.0.1:8091_0_1961136987.sock:26433] [lcb,retryq L:164 I:115587435] Failing command (pkt=0x7f97a4034290, opaque=4839, retries=31, time=5000090us, status=0x82) requested error: LCB_ERR_TIMEOUT (201), from retry queue: LCB_ERR_TEMPORARY_FAILURE (207) N1QL DML failures: 2021-05-06T05:07:01.223-07:00 [Info] eventing-consumer [worker_n1ql_0:/tmp/127.0.0.1:8091_0_3004449606.sock:26473] { "message": "SDK error : LCB_ERR_DML_FAILURE (405) Query error : {\n\"requestID\": \"67ed5bcb-4114-4912-8494-64838a971a5c\",\n\"clientContextID\": \"4@n1ql.js(OnUpdate)\",\n\"signature\": null,\n\"results\": [\n],\n\"errors\": [{\"code\":12009,\"msg\":\"DML Error, possible causes include concurrent modificationFailed to perform INSERT - cause: Duplicate Key doc_39532\"}],\n\"status\": \"errors\",\n\"metrics\": {\"elapsedTime\": \"2.436922ms\",\"executionTime\": \"2.423477ms\",\"resultCount\": 0,\"resultSize\": 0,\"serviceLoad\": 25,\"errorCount\": 1}\n}\n", "stack": "Error\n at OnUpdate (n1ql.js:4:17)"} Line: 3 Code: var query = N1QL('INSERT INTO dst_bucket.scope_1.coll_4 ( KEY, VALUE ) VALUES ( $docId ,\'Hello World\');', {'$docId':docId}, { 'consistency': 'none' }); Stack: Error at OnUpdate (n1ql.js:4:17) Around this time, memory usage on KV node .101 crossed high watermark of 178 MB: This resulted in tmp_oom errors which converts to LCB_ERR_TEMPORARY_FAILURE and hence timeouts for bucket ops and DML errors for n1ql queries:
            sujay.gad Sujay Gad added a comment -

            Test passed on toy build with latest run after increasing memory Quota.
            http://qa.sc.couchbase.com/job/dev_testbed_blr3/674/

             

            sujay.gad Sujay Gad added a comment - Test passed on toy build with latest run after increasing memory Quota. http://qa.sc.couchbase.com/job/dev_testbed_blr3/674/  
            pablo.silberkasten Pablo Silberkasten (Inactive) added a comment - Ankit Prabhu  lcb 3.1.3 was released earlier today  https://issues.couchbase.com/projects/CCBC/versions/17503
            sujay.gad Sujay Gad added a comment -

            Issue not observed on 7.0.0 - 5161

            sujay.gad Sujay Gad added a comment - Issue not observed on 7.0.0 - 5161

            People

              sujay.gad Sujay Gad
              sujay.gad Sujay Gad
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty