Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45967

Volume: Data missing for N1QL handlers

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0 build 5046 

      Test: Eventing volume https://docs.google.com/document/d/1dJ6VhHtLaRLTPVKQm0EomakzUNZDz2x7xJv3JqZVuqw/edit?usp=sharing 

      • Create 9 node cluster ( Eventing:2 , Kv:5,index:1,query:1)
      • Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl)
      • deploy bucket op, timers and N1ql
      • Rebalance in 1 eventing node
      • Start CRUD on bucket_op where no handler are deployed
      • Verify handlers

      Source Bucket(n1ql_op.scope_0.coll0) : 6226696, Destination Bucket(['n1ql_op.scope_1.coll0']) : 6226691 

      Source Bucket(n1ql_op.scope_0.coll1) : 6226416, Destination Bucket(['n1ql_op.scope_1.coll1']) : 6226410 

      function OnUpdate(doc, meta) {
          try{
              var query = UPSERT INTO n1ql_op.scope_1.`coll0` ( KEY, VALUE ) VALUES ( $meta.id ,"n1ql insert");
          }
          catch(e){
              log("Query failed: ",e)
          }
      }function OnDelete(meta, options) {
      try{
              var query = DELETE from n1ql_op.scope_1.`coll0` USE KEYS $meta.id ;
          }
          catch(e){
              log("Query failed: ",e)
          }
      }
      
      

       
      function OnUpdate(doc, meta) {
          try{
              var query = UPSERT INTO n1ql_op.scope_1.`coll1` ( KEY, VALUE ) VALUES ( $meta.id ,"n1ql insert");
          }
          catch(e){
              log("Query failed: ",e)
          }
      }function OnDelete(meta, options) {
      try{
              var query = DELETE from n1ql_op.scope_1.`coll1` USE KEYS $meta.id ;
          }
          catch(e){
              log("Query failed: ",e)
          }
      } 

      Attachments

        1. invalid_requests (1).png
          invalid_requests (1).png
          268 kB
        2. n1ql_dmls.png
          n1ql_dmls.png
          286 kB
        3. n1ql_op_scope1_docs_difference.png
          n1ql_op_scope1_docs_difference.png
          386 kB
        4. n1ql_test0.log
          26.99 MB
        5. n1ql_test01.log
          29.66 MB
        6. n1ql_test02.log
          37.63 MB
        7. n1ql0.log
          18.79 MB
        8. n1ql01.log
          19.17 MB
        9. n1ql02.log
          23.03 MB
        10. queries_prior_to_deployment.png
          queries_prior_to_deployment.png
          269 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            vikas.chaudhary Vikas Chaudhary created issue -
            vikas.chaudhary Vikas Chaudhary made changes -
            Field Original Value New Value
            Description Build: 7.0.0 build 5046 

            Test: Eventing volume [https://docs.google.com/document/d/1dJ6VhHtLaRLTPVKQm0EomakzUNZDz2x7xJv3JqZVuqw/edit?usp=sharing
             * Create 9 node cluster ( Eventing:2 , Kv:5,index:1,query:1)
             * Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl)
             * deploy bucket op, timers and N1ql
             * Rebalance in 1 eventing node
             * Start CRUD on bucket_op where no handler are deployed
             * Verify handlers

            {noformat}
            Source Bucket(n1ql_op.scope_0.coll0) : 6226696, Destination Bucket(['n1ql_op.scope_1.coll0']) : 6226691 {noformat}
            {noformat}
            Source Bucket(n1ql_op.scope_0.coll1) : 6226416, Destination Bucket(['n1ql_op.scope_1.coll1']) : 6226410 {noformat}
            Build: 7.0.0 build 5046 

            Test: Eventing volume [https://docs.google.com/document/d/1dJ6VhHtLaRLTPVKQm0EomakzUNZDz2x7xJv3JqZVuqw/edit?usp=sharing
             * Create 9 node cluster ( Eventing:2 , Kv:5,index:1,query:1)
             * Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl)
             * deploy bucket op, timers and N1ql
             * Rebalance in 1 eventing node
             * Start CRUD on bucket_op where no handler are deployed
             * Verify handlers

            {noformat}Source Bucket(n1ql_op.scope_0.coll0) : 6226696, Destination Bucket(['n1ql_op.scope_1.coll0']) : 6226691 {noformat}
            {noformat}Source Bucket(n1ql_op.scope_0.coll1) : 6226416, Destination Bucket(['n1ql_op.scope_1.coll1']) : 6226410 {noformat}
            {noformat}
            function OnUpdate(doc, meta) {
                try{
                    var query = UPSERT INTO n1ql_op.scope_1.`coll0` ( KEY, VALUE ) VALUES ( $meta.id ,"n1ql insert");
                }
                catch(e){
                    log("Query failed: ",e)
                }
            }function OnDelete(meta, options) {
            try{
                    var query = DELETE from n1ql_op.scope_1.`coll0` USE KEYS $meta.id ;
                }
                catch(e){
                    log("Query failed: ",e)
                }
            }

            {noformat}
            {noformat}

            function OnUpdate(doc, meta) {
                try{
                    var query = UPSERT INTO n1ql_op.scope_1.`coll1` ( KEY, VALUE ) VALUES ( $meta.id ,"n1ql insert");
                }
                catch(e){
                    log("Query failed: ",e)
                }
            }function OnDelete(meta, options) {
            try{
                    var query = DELETE from n1ql_op.scope_1.`coll1` USE KEYS $meta.id ;
                }
                catch(e){
                    log("Query failed: ",e)
                }
            } {noformat}
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Jeelan Poola [ jeelan.poola ] Abhishek Jindal [ abhishek.jindal ]

            Vikas Chaudhary Looks like this is a dupe of recently raised MB-45713 . NMVB errors reported by query after 12 retries:

            _time=2021-04-27T13:42:44.085-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>CD075150-122_89131</ud> for Keyspace default:n1ql_op.scope_1.coll1. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud>
            _time=2021-04-27T13:42:44.646-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>E7888C92-92_89275</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud>
            _time=2021-04-27T13:42:46.718-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>4F7D8E46-93_89392</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud>
            _time=2021-04-27T13:42:46.982-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>F11A45C5-101_89245</ud> for Keyspace default:n1ql_op.scope_1.coll1. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud>
            _time=2021-04-27T13:43:27.081-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>47E3F194-103_93074</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud>
            _time=2021-04-27T13:43:36.093-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>47E3F194-103_94083</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud>
            _time=2021-04-27T13:43:49.055-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>CD075150-122_95553</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud>
            _time=2021-04-27T13:43:49.877-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>9F647ACB-100_95709</ud> for Keyspace default:n1ql_op.scope_1.coll1. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud>
            

            At this point in time, node .186 was being rebalanced in while the 2 KV nodes - .185 and .242 were being rebalanced out. Hence, it might be expected for upserts for a few documents to fail even after 12 retries:

            021-04-27T13:29:46.878-07:00, ns_orchestrator:0:info:message(ns_1@172.23.106.253) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.186','ns_1@172.23.105.190',
                                             'ns_1@172.23.106.154','ns_1@172.23.106.182',
                                             'ns_1@172.23.106.243','ns_1@172.23.106.253',
                                             'ns_1@172.23.106.255','ns_1@172.23.107.89',
                                             'ns_1@172.23.97.213','ns_1@172.23.97.214'], EjectNodes = ['ns_1@172.23.105.185',
                                                                                                       'ns_1@172.23.106.242'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 986cf2c9e706df09fbae66d4c
            65c6acb
            

            abhishek.jindal Abhishek Jindal added a comment - Vikas Chaudhary Looks like this is a dupe of recently raised MB-45713 . NMVB errors reported by query after 12 retries: _time=2021-04-27T13:42:44.085-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>CD075150-122_89131</ud> for Keyspace default:n1ql_op.scope_1.coll1. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud> _time=2021-04-27T13:42:44.646-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>E7888C92-92_89275</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud> _time=2021-04-27T13:42:46.718-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>4F7D8E46-93_89392</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud> _time=2021-04-27T13:42:46.982-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>F11A45C5-101_89245</ud> for Keyspace default:n1ql_op.scope_1.coll1. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud> _time=2021-04-27T13:43:27.081-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>47E3F194-103_93074</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud> _time=2021-04-27T13:43:36.093-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>47E3F194-103_94083</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud> _time=2021-04-27T13:43:49.055-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>CD075150-122_95553</ud> for Keyspace default:n1ql_op.scope_1.coll0. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud> _time=2021-04-27T13:43:49.877-07:00 _level=ERROR _msg=Failed to perform UPSERT on key <ud>9F647ACB-100_95709</ud> for Keyspace default:n1ql_op.scope_1.coll1. Error - <ud>unable to complete action after 12 attemps: NOT_MY_VBUCKET</ud> At this point in time, node .186 was being rebalanced in while the 2 KV nodes - .185 and .242 were being rebalanced out. Hence, it might be expected for upserts for a few documents to fail even after 12 retries: 021-04-27T13:29:46.878-07:00, ns_orchestrator:0:info:message(ns_1@172.23.106.253) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.186','ns_1@172.23.105.190', 'ns_1@172.23.106.154','ns_1@172.23.106.182', 'ns_1@172.23.106.243','ns_1@172.23.106.253', 'ns_1@172.23.106.255','ns_1@172.23.107.89', 'ns_1@172.23.97.213','ns_1@172.23.97.214'], EjectNodes = ['ns_1@172.23.105.185', 'ns_1@172.23.106.242'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 986cf2c9e706df09fbae66d4c 65c6acb
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Vikas Chaudhary [ vikas.chaudhary ]
            vikas.chaudhary Vikas Chaudhary added a comment - - edited

            Abhishek Jindal MB-44324 is closed . we have updated lcb changes in 5043. I am not sure if this is expected by customer to fail query when rebalance is in progress.

            vikas.chaudhary Vikas Chaudhary added a comment - - edited Abhishek Jindal   MB-44324 is closed . we have updated lcb changes in 5043. I am not sure if this is expected by customer to fail query when rebalance is in progress.
            vikas.chaudhary Vikas Chaudhary made changes -
            Assignee Vikas Chaudhary [ vikas.chaudhary ] Abhishek Jindal [ abhishek.jindal ]

            I am not sure if this is expected by customer to fail query when rebalance is in progress.

            Sure, in that case assigning to query folks to check on the query failures due to NOT_MY_VBUCKET exceptions and what options are left for the client / embedder once the query reflects back with such a nmvb. CC: Marco Greco

            abhishek.jindal Abhishek Jindal added a comment - I am not sure if this is expected by customer to fail query when rebalance is in progress. Sure, in that case assigning to query folks to check on the query failures due to NOT_MY_VBUCKET exceptions and what options are left for the client / embedder once the query reflects back with such a nmvb. CC: Marco Greco
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Marco Greco [ marco.greco ]
            abhishek.jindal Abhishek Jindal made changes -
            Component/s query [ 10718 ]
            marco.greco Marco Greco added a comment -

            It is wholly expected for NMVB to happen during rebalances.
            Query retries 2xnodes times before giving up, and after the fix to MB-44328, it waits 100ms before each retry (ns_server should have refreshed the vbmap in about 100ms).
            The fact that after 1.2 seconds query still gets the outdated vbmap means that ns_server is struggling for whatever reason.
            There is a plan to use the vbmap from the memcached response to handle individual NMVB failures, but that has been deemed too risky for cheshire cat and will be done in the next version.
            Your options are to ask ns_server to see why n1ql didn't get an updated vbmap in 1.2 seconds, and as eventing, trap the NMVB and retry the operation.

            marco.greco Marco Greco added a comment - It is wholly expected for NMVB to happen during rebalances. Query retries 2xnodes times before giving up, and after the fix to MB-44328 , it waits 100ms before each retry (ns_server should have refreshed the vbmap in about 100ms). The fact that after 1.2 seconds query still gets the outdated vbmap means that ns_server is struggling for whatever reason. There is a plan to use the vbmap from the memcached response to handle individual NMVB failures, but that has been deemed too risky for cheshire cat and will be done in the next version. Your options are to ask ns_server to see why n1ql didn't get an updated vbmap in 1.2 seconds, and as eventing, trap the NMVB and retry the operation.
            marco.greco Marco Greco made changes -
            Assignee Marco Greco [ marco.greco ] Abhishek Jindal [ abhishek.jindal ]
            kamini.jagtiani Kamini Jagtiani (Inactive) made changes -
            Component/s query [ 10718 ]

            Update : The patch http://review.couchbase.org/c/eventing/+/152666 has the fix where for DML_FAILUREs coming from query engine, eventing will retry the DML query until script timeout (which by default in 60 seconds). I believe this is enough time to address any kind of issues that can result in a DML error such as etmpfail, not_my_vbucket, enomem, connection error etc.

            This patch will be on master in a couple of days.

            abhishek.jindal Abhishek Jindal added a comment - Update : The patch http://review.couchbase.org/c/eventing/+/152666 has the fix where for DML_FAILUREs coming from query engine, eventing will retry the DML query until script timeout (which by default in 60 seconds). I believe this is enough time to address any kind of issues that can result in a DML error such as etmpfail, not_my_vbucket, enomem, connection error etc. This patch will be on master in a couple of days.
            abhishek.jindal Abhishek Jindal made changes -
            Due Date 05/May/21

            Vikas Chaudhary Patch is on master. Please feel free to run tests on the next build and let me know if we still observe any mismatch (especially while KV rebalance in going on at the time when DML queries are being fired).

            abhishek.jindal Abhishek Jindal added a comment - Vikas Chaudhary Patch is on master. Please feel free to run tests on the next build and let me know if we still observe any mismatch (especially while KV rebalance in going on at the time when DML queries are being fired).
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Vikas Chaudhary [ vikas.chaudhary ]
            abhishek.jindal Abhishek Jindal made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            chanabasappa.ghali Chanabasappa Ghali made changes -
            Assignee Vikas Chaudhary [ vikas.chaudhary ] Sujay Gad [ JIRAUSER25279 ]

            Build couchbase-server-7.0.0-5092 contains eventing commit d2f1786 with commit message:
            MB-45967 : Retry on DML Failures

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5092 contains eventing commit d2f1786 with commit message: MB-45967 : Retry on DML Failures
            sujay.gad Sujay Gad made changes -
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            sujay.gad Sujay Gad made changes -
            Assignee Sujay Gad [ JIRAUSER25279 ] Abhishek Jindal [ abhishek.jindal ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment n1ql_op_scope1_docs_difference.png [ 138699 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment invalid_requests (1).png [ 138707 ]

            A difference of 12 documents between n1ql_op:scope_0:coll1 and n1ql_op:scope_1:coll1 is seen. This difference is coming only from KV node .253 and n1ql handler : n1ql1_0

            Worth noting that the difference was observed very early on in the test at 2021-05-04T00:42:10 UTC-07:00 At this point in time the handler n1ql1_0 has been running only for around 14 minutes (I observe the handler started at 2021-05-04T00:28:10).

            Now, where is this 12 document count difference coming from? With the recent patch http://review.couchbase.org/c/eventing/+/152666 it cannot be a DML_FAILURE (else the same operation would've been retried until script timeout) and query would've reported an error.

            No DML errors reported by query service as well. Hence, can confirm that the previous few issues regarding DML_FAILUREs in previous comments have been resolved.
            This time, we see exactly 12 invalid_requests (I believe corresponding to upserts for 12 docs) reported by query engine:

            As per their code, invalid request is reported when a query HTTP request is not hitting the appropriate "/query/service" endpoint.
            These errors are being reported at "00:24:10". Hence, requests were made by libcouchbase at 00:24:10.
            From process list command in cbcollect_info, we can see that the start time of one of the query engine threads is reported at "00:31:38":

             couchba+  41809  42684  41803   42  5.0      0  13167  19   0 5355892 2375284 ?     Sl   futex_wait_q 00:31:38   4:28 cbq-engine      /opt/couchbase/bin/cbq-engine --datastore=http://127.0.0.1:8091 --http=:8093 --         configstore=http://127.0.0.1:8091 --enterprise=true --ipv4=required --ipv6=optional --https=:18093 --certfile=/opt/couchbase/var/lib/couchbase/config/memcached-cert.pem --keyfile=/opt/couchbase/var/lib/couchbase/config/   memcached-key.pem
            

            Hence, we can surely say that at the time when eventing handler was deployed the query engine was not completely up and running.
            Sujay Gad Given the observations above, before the handler deployment and load on src collection initiates OR after the query node is rebalanced into the cluster, I would suggest a sleep for some time. We can see here that it is probable that eventing n1ql handler was deployed and loaded started when one of quwry engine's threads wasn't up.
            We need to eliminate such variables.

            abhishek.jindal Abhishek Jindal added a comment - A difference of 12 documents between n1ql_op:scope_0:coll1 and n1ql_op:scope_1:coll1 is seen. This difference is coming only from KV node .253 and n1ql handler : n1ql1_0 Worth noting that the difference was observed very early on in the test at 2021-05-04T00:42:10 UTC-07:00 At this point in time the handler n1ql1_0 has been running only for around 14 minutes (I observe the handler started at 2021-05-04T00:28:10). Now, where is this 12 document count difference coming from? With the recent patch http://review.couchbase.org/c/eventing/+/152666 it cannot be a DML_FAILURE (else the same operation would've been retried until script timeout) and query would've reported an error. No DML errors reported by query service as well. Hence, can confirm that the previous few issues regarding DML_FAILUREs in previous comments have been resolved. This time, we see exactly 12 invalid_requests (I believe corresponding to upserts for 12 docs) reported by query engine: As per their code, invalid request is reported when a query HTTP request is not hitting the appropriate "/query/service" endpoint. These errors are being reported at "00:24:10". Hence, requests were made by libcouchbase at 00:24:10. From process list command in cbcollect_info, we can see that the start time of one of the query engine threads is reported at "00:31:38": couchba+ 41809 42684 41803 42 5.0 0 13167 19 0 5355892 2375284 ? Sl futex_wait_q 00:31:38 4:28 cbq-engine /opt/couchbase/bin/cbq-engine --datastore=http://127.0.0.1:8091 --http=:8093 -- configstore=http://127.0.0.1:8091 --enterprise=true --ipv4=required --ipv6=optional --https=:18093 --certfile=/opt/couchbase/var/lib/couchbase/config/memcached-cert.pem --keyfile=/opt/couchbase/var/lib/couchbase/config/ memcached-key.pem Hence, we can surely say that at the time when eventing handler was deployed the query engine was not completely up and running. Sujay Gad Given the observations above, before the handler deployment and load on src collection initiates OR after the query node is rebalanced into the cluster, I would suggest a sleep for some time. We can see here that it is probable that eventing n1ql handler was deployed and loaded started when one of quwry engine's threads wasn't up. We need to eliminate such variables.
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Sujay Gad [ JIRAUSER25279 ]
            sujay.gad Sujay Gad added a comment -

            Abhishek Jindal query node is added to the cluster around 00:22:18

            [2021-05-04T00:22:08-07:00, sequoiatools/couchbase-cli:7.0:2439c0] server-add -c 172.23.106.253 -u Administrator -p password --server-add 172.23.106.255 --server-add-username Administrator --server-add-password password --services query
            [pull] sequoiatools/couchbase-cli:7.0
            [2021-05-04T00:22:18-07:00, sequoiatools/couchbase-cli:7.0:bda1dd] rebalance -c 172.23.106.253 -u Administrator -p password
            

            After that there is no change in topology of query nodes.

            n1ql handler is deployed at 00:26:48 which is after the 12 invalid requests are observed.

            [2021-05-04T00:26:48-07:00, sequoiatools/eventing:7.0:d9407f] eventing_helper.py -i 172.23.105.185 -u Administrator -p password -o deploy --name n1ql
            

            Before n1ql handler is deployed following n1ql requests are also performed.

            [2021-05-04T00:22:58-07:00, sequoiatools/cbq:c5ec2f] -e=http://172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on bucket_op
            [pull] sequoiatools/cbq
            [2021-05-04T00:23:08-07:00, sequoiatools/cbq:9f7629] -e=http://172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on source_op
            [pull] sequoiatools/cbq
            [2021-05-04T00:23:17-07:00, sequoiatools/cbq:e1a9bc] -e=http://172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on timer_op
            [pull] sequoiatools/cbq
            [2021-05-04T00:23:27-07:00, sequoiatools/cbq:d511e4] -e=http://172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on curl_op
            [pull] sequoiatools/cbq
            [2021-05-04T00:23:36-07:00, sequoiatools/cbq:bbfd42] -e=http://172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on n1ql_op
            [pull] sequoiatools/cbq
            [2021-05-04T00:23:45-07:00, sequoiatools/cbq:55b569] -e=http://172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on metadata
            

            sujay.gad Sujay Gad added a comment - Abhishek Jindal  query node is added to the cluster around 00:22:18 [ 2021 - 05 -04T00: 22 : 08 - 07 : 00 , sequoiatools/couchbase-cli: 7.0 :2439c0] server-add -c 172.23 . 106.253 -u Administrator -p password --server-add 172.23 . 106.255 --server-add-username Administrator --server-add-password password --services query [pull] sequoiatools/couchbase-cli: 7.0 [ 2021 - 05 -04T00: 22 : 18 - 07 : 00 , sequoiatools/couchbase-cli: 7.0 :bda1dd] rebalance -c 172.23 . 106.253 -u Administrator -p password After that there is no change in topology of query nodes. n1ql handler is deployed at 00:26:48 which is after the 12 invalid requests are observed. [ 2021 - 05 -04T00: 26 : 48 - 07 : 00 , sequoiatools/eventing: 7.0 :d9407f] eventing_helper.py -i 172.23 . 105.185 -u Administrator -p password -o deploy --name n1ql Before n1ql handler is deployed following n1ql requests are also performed. [ 2021 - 05 -04T00: 22 : 58 - 07 : 00 , sequoiatools/cbq:c5ec2f] -e=http: //172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on bucket_op [pull] sequoiatools/cbq [ 2021 - 05 -04T00: 23 : 08 - 07 : 00 , sequoiatools/cbq:9f7629] -e=http: //172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on source_op [pull] sequoiatools/cbq [ 2021 - 05 -04T00: 23 : 17 - 07 : 00 , sequoiatools/cbq:e1a9bc] -e=http: //172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on timer_op [pull] sequoiatools/cbq [ 2021 - 05 -04T00: 23 : 27 - 07 : 00 , sequoiatools/cbq:d511e4] -e=http: //172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on curl_op [pull] sequoiatools/cbq [ 2021 - 05 -04T00: 23 : 36 - 07 : 00 , sequoiatools/cbq:bbfd42] -e=http: //172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on n1ql_op [pull] sequoiatools/cbq [ 2021 - 05 -04T00: 23 : 45 - 07 : 00 , sequoiatools/cbq:55b569] -e=http: //172.23.106.255:8093 -u=Administrator -p=password -script=create PRIMARY INDEX on metadata
            sujay.gad Sujay Gad made changes -
            Assignee Sujay Gad [ JIRAUSER25279 ] Abhishek Jindal [ abhishek.jindal ]
            mihir.kamdar Mihir Kamdar (Inactive) made changes -
            Labels volume-test affects-cc-testing volume-test

            Sujay and Abhi to work on the more debugging to check completed and failed request from n1ql.

            ritam.sharma Ritam Sharma added a comment - Sujay and Abhi to work on the more debugging to check completed and failed request from n1ql.
            abhishek.jindal Abhishek Jindal made changes -
            Due Date 05/May/21 07/May/21
            abhishek.jindal Abhishek Jindal made changes -
            Attachment queries_prior_to_deployment.png [ 139237 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment n1ql_dmls.png [ 139238 ]
            abhishek.jindal Abhishek Jindal added a comment - - edited

            Continuing with the earlier investigation, we know that the doc count difference was spotted on node .253 somewhere within a period of 20 minutes starting at 2021-05-04T00:27:10 UTC-07:00 (when the handlers were first deployed) and 2021-05-04T00:42:10 UTC-07:00.

            By 2021-05-04T00:42:10 handler n1ql0 accross all nodes fired 890,680 times. Handler n1ql1 fired 891,104 times.
            In total n1ql) + n1ql1 handlers fired 890,680 + 891,104 = 1781784. This means that there should be 1781784 query upsert requests reported by query engine. Let's check that.
            From query's perspective, 6 queries were fired before any handler was even deployed.

            So, we will check the stat - n1ql_requests - n1ql_selects - 6 to get all dml requests hitting query node .255 which is confirmed to be 1781784:

            Given that we got same numbers from two independent sources, we can confirm that any request that all on_update_success calls are making it to the query service.
            Hence, only possibilities for a mismatch are:

            1. DCP not streaming all mutations for that one KV node : items_remaining KV stat states 0 items remaining to send. So can eliminate for now.
            2. Eventing can be dropping mutations.
            3. Issue with query service where it incorrectly states 0 n1ql_errors.
            4. Issue with the kind of doc ids the test automation is inserting. It might be that doc ids have a non-printable char and hence considered different on source collection but same on destination collection.

            We can test for points 2,3,4 via following:

            • I see the onupdate handler uses upsert n1ql query: UPSERT INTO n1ql_op.scope_1.`coll0` ( KEY, VALUE ) VALUES ( $meta.id ,"n1ql insert"); Ideally if volume test is just doing insert ops on source collection, handler should be running INSERT INTO dml queries as well. With this, we might catch a case of duplicate keys (mutations on single key).
            • In onupdate handler, please add the app log statement: log("Doc created/updated", meta.id);
            • Just after INSERT INTO query, the log statement: log(meta.id, query) would also be needed to check for successful invocation. 
            • Additonally, I observe that the volume test creates 8 collections : n1ql_op:scope0:coll0, coll1, coll2, coll3 as source and n1ql_op:scope1:coll0, coll1, coll2, coll3 as destination. To eliminate any transient issues such as query timeout etc, would request to change the test to create a new n1ql handler - say n1ql2 - which listens on *same src collection as problematic handler (in this case n1ql1) but writes to a different dst collection, say n1ql_op:scope1:coll2.

            So the src collection will be same as that for n1ql1 : n1ql_op:scope0:coll1 but destination can be n1ql_op:scope1:coll2.
            Reason being - If we do not observe a mismatch for the second handler, this will eliminate point 4 (bad doc ID) being an issue so that one can focus only on any potential UPSERT timeouts, DCP mutations getting dropped etc.

            • While deploying a handler would also request setting the "app_log_max_size" for the handler in settings to a large value such as 1 GB (1073741824 bytes) so that app logs do not rotate over the course of volume test due to mutations. (Worth noting that there should be at least 1 GB disk space to hold this log file).

            With this, once we reproduce a mismatch the following will be needed:

            1. A set of cbcollects
            2. Application logs

            abhishek.jindal Abhishek Jindal added a comment - - edited Continuing with the earlier investigation, we know that the doc count difference was spotted on node .253 somewhere within a period of 20 minutes starting at 2021-05-04T00:27:10 UTC-07:00 (when the handlers were first deployed) and 2021-05-04T00:42:10 UTC-07:00 . By 2021-05-04T00:42:10 handler n1ql0 accross all nodes fired 890,680 times. Handler n1ql1 fired 891,104 times. In total n1ql) + n1ql1 handlers fired 890,680 + 891,104 = 1781784. This means that there should be 1781784 query upsert requests reported by query engine. Let's check that. From query's perspective, 6 queries were fired before any handler was even deployed. So, we will check the stat - n1ql_requests - n1ql_selects - 6 to get all dml requests hitting query node .255 which is confirmed to be 1781784: Given that we got same numbers from two independent sources, we can confirm that any request that all on_update_success calls are making it to the query service. Hence, only possibilities for a mismatch are: 1. DCP not streaming all mutations for that one KV node : items_remaining KV stat states 0 items remaining to send. So can eliminate for now. 2. Eventing can be dropping mutations. 3. Issue with query service where it incorrectly states 0 n1ql_errors. 4. Issue with the kind of doc ids the test automation is inserting. It might be that doc ids have a non-printable char and hence considered different on source collection but same on destination collection. — We can test for points 2,3,4 via following: I see the onupdate handler uses upsert n1ql query: UPSERT INTO n1ql_op.scope_1.`coll0` ( KEY, VALUE ) VALUES ( $meta.id ,"n1ql insert"); Ideally if volume test is just doing insert ops on source collection, handler should be running INSERT INTO dml queries as well. With this, we might catch a case of duplicate keys (mutations on single key). In onupdate handler, please add the app log statement: log("Doc created/updated", meta.id); Just after INSERT INTO query, the log statement: log(meta.id, query) would also be needed to check for successful invocation.  Additonally, I observe that the volume test creates 8 collections : n1ql_op:scope0:coll0, coll1, coll2, coll3 as source and n1ql_op:scope1:coll0, coll1, coll2, coll3 as destination. To eliminate any transient issues such as query timeout etc, would request to change the test to create a new n1ql handler - say n1ql2 - which listens on *same src collection as problematic handler (in this case n1ql1) but writes to a different dst collection, say n1ql_op:scope1:coll2. So the src collection will be same as that for n1ql1 : n1ql_op:scope0:coll1 but destination can be n1ql_op:scope1:coll2. Reason being - If we do not observe a mismatch for the second handler, this will eliminate point 4 (bad doc ID) being an issue so that one can focus only on any potential UPSERT timeouts, DCP mutations getting dropped etc. While deploying a handler would also request setting the "app_log_max_size" for the handler in settings to a large value such as 1 GB (1073741824 bytes) so that app logs do not rotate over the course of volume test due to mutations. (Worth noting that there should be at least 1 GB disk space to hold this log file). With this, once we reproduce a mismatch the following will be needed: 1. A set of cbcollects 2. Application logs
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Sujay Gad [ JIRAUSER25279 ]
            sujay.gad Sujay Gad made changes -
            Attachment n1ql_test0.log [ 139612 ]
            Attachment n1ql0.log [ 139613 ]
            ritam.sharma Ritam Sharma made changes -
            Assignee Sujay Gad [ JIRAUSER25279 ] Abhishek Jindal [ abhishek.jindal ]
            sujay.gad Sujay Gad made changes -
            Attachment n1ql_test01.log [ 139616 ]
            Attachment n1ql01.log [ 139617 ]
            sujay.gad Sujay Gad made changes -
            Attachment n1ql_test02.log [ 139618 ]
            Attachment n1ql02.log [ 139619 ]

            Sujay Gad I see 3 logs n1ql01.log n1ql0.log and n1ql02.log . Can you please specify which log was extracted from which eventing node?

            abhishek.jindal Abhishek Jindal added a comment - Sujay Gad I see 3 logs n1ql01.log n1ql0.log and n1ql02.log . Can you please specify which log was extracted from which eventing node?
            sujay.gad Sujay Gad made changes -
            Attachment n1ql_test02.log [ 139618 ]
            sujay.gad Sujay Gad made changes -
            Attachment n1ql_test02.log [ 139620 ]
            abhishek.jindal Abhishek Jindal made changes -
            Due Date 07/May/21 14/May/21

            Abhishek Jindal, Sujay Gad has provided a log-file-name to node mapping in his comment. Is that info sufficient to do an initial triage of this issue with the latest run?

            srinivasan.raman Srinivasan Raman added a comment - Abhishek Jindal , Sujay Gad  has provided a log-file-name to node mapping in his comment. Is that info sufficient to do an initial triage of this issue with the latest run?

            Srinivasan Raman - Currently checking the application logs to see whether these help in narrowing the line of thought further. Will update if any further info is needed.

            Thanks,

            abhishek.jindal Abhishek Jindal added a comment - Srinivasan Raman - Currently checking the application logs to see whether these help in narrowing the line of thought further. Will update if any further info is needed. Thanks,

            Thanks for sharing the app logs Sujay Gad We have the rca for the same. Working on a fix.
            Worth noting that MB-44784 also has the exact same root cause as in this case. It is just more easily reproducible for N1QL handlers case because N1QL queries unlike timer creation is slower which makes the observed condition more probable.

            abhishek.jindal Abhishek Jindal added a comment - Thanks for sharing the app logs Sujay Gad We have the rca for the same. Working on a fix. Worth noting that MB-44784 also has the exact same root cause as in this case. It is just more easily reproducible for N1QL handlers case because N1QL queries unlike timer creation is slower which makes the observed condition more probable.
            abhishek.jindal Abhishek Jindal made changes -
            Link This issue relates to MB-44784 [ MB-44784 ]
            vinayaka.kamath Vinayaka Kamath (Inactive) made changes -
            Link This issue is duplicated by MB-46264 [ MB-46264 ]
            abhishek.jindal Abhishek Jindal added a comment - - edited

            Sujay Gad While patch http://review.couchbase.org/c/eventing/+/153424 is on CI, have created this toy:

            http://server.jenkins.couchbase.com/view/Toys/job/toy-unix-simple/2945/
            http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/12945/

            While I have tested this patch locally, would request another run of the same N1QL volume test with this build. This will confirm the issue has been fixed or not.

            Would also request another run of timers volume test too. This patch addresses both.

            abhishek.jindal Abhishek Jindal added a comment - - edited Sujay Gad While patch http://review.couchbase.org/c/eventing/+/153424 is on CI, have created this toy: http://server.jenkins.couchbase.com/view/Toys/job/toy-unix-simple/2945/ http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/12945/ While I have tested this patch locally, would request another run of the same N1QL volume test with this build. This will confirm the issue has been fixed or not. Would also request another run of timers volume test too. This patch addresses both.
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Sujay Gad [ JIRAUSER25279 ]
            keshav Keshav Murthy made changes -
            Due Date 14/May/21 17/May/21

            Abhishek Jindal - Patch looks good, the following step would earlier fail with n1ql and sometime for timers , with patch the step looks good

            [2021-05-16T22:31:45-07:00, sequoiatools/eventing:7.0:7be962] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s bucket_op.scope_0.coll0 -d bucket_op.scope_1.coll0
            [2021-05-16T22:39:51-07:00, sequoiatools/eventing:7.0:92322b] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s bucket_op.scope_0.coll1 -d bucket_op.scope_1.coll1
            [2021-05-16T22:39:57-07:00, sequoiatools/eventing:7.0:c214fc] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s bucket_op.scope_0.coll0 -d source_op.scope_0.coll0
            [2021-05-16T22:40:02-07:00, sequoiatools/eventing:7.0:d06d5c] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s bucket_op.scope_0.coll1 -d source_op.scope_0.coll1
            [2021-05-16T22:40:08-07:00, sequoiatools/eventing:7.0:ee415c] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s timer_op.scope_0.coll0 -d timer_op.scope_1.coll0
            [2021-05-16T22:40:13-07:00, sequoiatools/eventing:7.0:01a45e] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s timer_op.scope_0.coll1 -d timer_op.scope_1.coll1
            [2021-05-16T22:40:19-07:00, sequoiatools/eventing:7.0:1870d1] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s n1ql_op.scope_0.coll0 -d n1ql_op.scope_1.coll0
            [2021-05-16T22:41:24-07:00, sequoiatools/eventing:7.0:85c2f8] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s n1ql_op.scope_0.coll1 -d n1ql_op.scope_1.coll1
            [pull] sequoiatools/gideon2
            

            ritam.sharma Ritam Sharma added a comment - Abhishek Jindal - Patch looks good, the following step would earlier fail with n1ql and sometime for timers , with patch the step looks good [2021-05-16T22:31:45-07:00, sequoiatools/eventing:7.0:7be962] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s bucket_op.scope_0.coll0 -d bucket_op.scope_1.coll0 [2021-05-16T22:39:51-07:00, sequoiatools/eventing:7.0:92322b] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s bucket_op.scope_0.coll1 -d bucket_op.scope_1.coll1 [2021-05-16T22:39:57-07:00, sequoiatools/eventing:7.0:c214fc] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s bucket_op.scope_0.coll0 -d source_op.scope_0.coll0 [2021-05-16T22:40:02-07:00, sequoiatools/eventing:7.0:d06d5c] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s bucket_op.scope_0.coll1 -d source_op.scope_0.coll1 [2021-05-16T22:40:08-07:00, sequoiatools/eventing:7.0:ee415c] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s timer_op.scope_0.coll0 -d timer_op.scope_1.coll0 [2021-05-16T22:40:13-07:00, sequoiatools/eventing:7.0:01a45e] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s timer_op.scope_0.coll1 -d timer_op.scope_1.coll1 [2021-05-16T22:40:19-07:00, sequoiatools/eventing:7.0:1870d1] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s n1ql_op.scope_0.coll0 -d n1ql_op.scope_1.coll0 [2021-05-16T22:41:24-07:00, sequoiatools/eventing:7.0:85c2f8] eventing_helper.py -i 172.23.106.255 -u Administrator -p password -o verify -s n1ql_op.scope_0.coll1 -d n1ql_op.scope_1.coll1 [pull] sequoiatools/gideon2
            abhishek.jindal Abhishek Jindal made changes -
            Link This issue is cloned by MB-46326 [ MB-46326 ]
            ritam.sharma Ritam Sharma made changes -
            Assignee Sujay Gad [ JIRAUSER25279 ] Abhishek Jindal [ abhishek.jindal ]
            ritam.sharma Ritam Sharma added a comment -

            Abhishek Jindal = Toy build looks good - test passed end to end.

            ritam.sharma Ritam Sharma added a comment - Abhishek Jindal = Toy build looks good - test passed end to end.

            Thx for update Ritam Sharma! will merge once CI gives a +1

            abhishek.jindal Abhishek Jindal added a comment - Thx for update Ritam Sharma ! will merge once CI gives a +1
            ankit.prabhu Ankit Prabhu made changes -
            Due Date 17/May/21 18/May/21
            wayne Wayne Siu made changes -
            Link This issue backports to MB-46326 [ MB-46326 ]
            wayne Wayne Siu made changes -
            Link This issue is cloned by MB-46326 [ MB-46326 ]

            Resolving as patch has been merged to unstable.

            abhishek.jindal Abhishek Jindal added a comment - Resolving as patch has been merged to unstable.
            abhishek.jindal Abhishek Jindal made changes -
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]

            Build couchbase-server-7.0.0-5180 contains eventing commit 9f60834 with commit message:
            MB-45967 : Address mutations getting dropped due to racing worker queues

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5180 contains eventing commit 9f60834 with commit message: MB-45967 : Address mutations getting dropped due to racing worker queues
            mihir.kamdar Mihir Kamdar (Inactive) made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Ritam Sharma [ ritam.sharma ]
            ritam.sharma Ritam Sharma made changes -
            Assignee Ritam Sharma [ ritam.sharma ] Sujay Gad [ JIRAUSER25279 ]
            sujay.gad Sujay Gad made changes -
            Resolution Fixed [ 1 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            sujay.gad Sujay Gad made changes -
            Assignee Sujay Gad [ JIRAUSER25279 ] Abhishek Jindal [ abhishek.jindal ]

             2021-05-19T23:59:04.159-07:00 [Info] eventing-consumer [worker_n1ql0_0_0:/tmp/127.0.0.1:8091_0_520596646.sock:99228] [lcb,connection L:153 I:3023917820] <172.23.106.242:11210> (SOCK=b071ce4ea08de6e7) Connected established
             2021-05-19T23:59:04.166-07:00 [Info] eventing-consumer [worker_n1ql0_0_0:/tmp/127.0.0.1:8091_0_520596646.sock:99228] [lcb,server L:965 I:2478103646] <NOHOST:NOPORT> (CTX=(nil),,SRV=0x7fb860086980,IX=0) Connection attempt  failed. Received LCB_ERR_AUTHENTICATION_FAILURE (206) from libcouchbase, received 0 (Success) from operating system
             2021-05-19T23:59:04.167-07:00 [Info] eventing-consumer [worker_n1ql0_0_0:/tmp/127.0.0.1:8091_0_520596646.sock:99228] [lcb,server L:965 I:3023917820] <NOHOST:NOPORT> (CTX=(nil),,SRV=0x7fb854006940,IX=0) Connection attempt  failed. Received LCB_ERR_AUTHENTICATION_FAILURE (206) from libcouchbase, received 0 (Success) from operating system
            

            It seems that libcouchbase was receiving authentication failures at the time node .242 was rebalanced in at 23:45.
            From kv engine's point of view:

            2021-05-20T00:45:29.100759-07:00 WARNING 247: Invalid password specified for [<ud>unknown</ud>]. Mechanism:[SCRAM-SHA512], UUID:[9f97af30-1bb8-40e6-6883-42f39005dedc]
            

            Memcached is not getting proper credentials from lcb. Recently we have made the change to shift the responsibility of caching username/password within eventing consumer process itself. Probably related to that. Checking further.

            abhishek.jindal Abhishek Jindal added a comment - 2021-05-19T23:59:04.159-07:00 [Info] eventing-consumer [worker_n1ql0_0_0:/tmp/127.0.0.1:8091_0_520596646.sock:99228] [lcb,connection L:153 I:3023917820] <172.23.106.242:11210> (SOCK=b071ce4ea08de6e7) Connected established 2021-05-19T23:59:04.166-07:00 [Info] eventing-consumer [worker_n1ql0_0_0:/tmp/127.0.0.1:8091_0_520596646.sock:99228] [lcb,server L:965 I:2478103646] <NOHOST:NOPORT> (CTX=(nil),,SRV=0x7fb860086980,IX=0) Connection attempt failed. Received LCB_ERR_AUTHENTICATION_FAILURE (206) from libcouchbase, received 0 (Success) from operating system 2021-05-19T23:59:04.167-07:00 [Info] eventing-consumer [worker_n1ql0_0_0:/tmp/127.0.0.1:8091_0_520596646.sock:99228] [lcb,server L:965 I:3023917820] <NOHOST:NOPORT> (CTX=(nil),,SRV=0x7fb854006940,IX=0) Connection attempt failed. Received LCB_ERR_AUTHENTICATION_FAILURE (206) from libcouchbase, received 0 (Success) from operating system It seems that libcouchbase was receiving authentication failures at the time node .242 was rebalanced in at 23:45. From kv engine's point of view: 2021-05-20T00:45:29.100759-07:00 WARNING 247: Invalid password specified for [<ud>unknown</ud>]. Mechanism:[SCRAM-SHA512], UUID:[9f97af30-1bb8-40e6-6883-42f39005dedc] Memcached is not getting proper credentials from lcb. Recently we have made the change to shift the responsibility of caching username/password within eventing consumer process itself. Probably related to that. Checking further.
            abhishek.jindal Abhishek Jindal made changes -
            Due Date 18/May/21 21/May/21

            http://review.couchbase.org/c/eventing/+/154046 has been merged to unstable which works around the issue. Worth running a new volume test once merged.

            abhishek.jindal Abhishek Jindal added a comment - http://review.couchbase.org/c/eventing/+/154046  has been merged to unstable which works around the issue. Worth running a new volume test once merged.
            abhishek.jindal Abhishek Jindal made changes -
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Sujay Gad [ JIRAUSER25279 ]

            Build couchbase-server-7.0.0-5211 contains eventing commit 8f10a7f with commit message:
            MB-45967 : Ensure creds cache entries are no more than 10 seconds old

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5211 contains eventing commit 8f10a7f with commit message: MB-45967 : Ensure creds cache entries are no more than 10 seconds old

            Build couchbase-server-7.0.1-5801 contains eventing commit 8f10a7f with commit message:
            MB-45967 : Ensure creds cache entries are no more than 10 seconds old

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.1-5801 contains eventing commit 8f10a7f with commit message: MB-45967 : Ensure creds cache entries are no more than 10 seconds old
            sujay.gad Sujay Gad added a comment -

            Issue not observed on 7.0.0-5219.

            sujay.gad Sujay Gad added a comment - Issue not observed on 7.0.0-5219.
            sujay.gad Sujay Gad made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Sujay Gad [ JIRAUSER25279 ] Abhishek Jindal [ abhishek.jindal ]
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]
            jeelan.poola Jeelan Poola added a comment - - edited

            The fix to retry DML failures internally in eventing is not right. It can have adverse side effects especially in cases where there is a multi-document insert/update/delete n1ql statement and there is a DML error reported by query node after 'N' documents are inserted/updated/deleted. Eventing retrying the entire multi-document n1ql operation will lead to incorrect behaviour.

            Unfortunately, the best thing that can be done in this case is to bubble up the DML failure exception into Javascript & let user retry the operation in Javascript code as more context (whether the n1ql operation is single or multi document etc.) is available there. Which essentially means, the volume test javascript code needs to be modified to retry on DML failures for N1QL operations. It will ensure a test pass, hopefully.

            jeelan.poola Jeelan Poola added a comment - - edited The fix to retry DML failures internally in eventing is not right. It can have adverse side effects especially in cases where there is a multi-document insert/update/delete n1ql statement and there is a DML error reported by query node after 'N' documents are inserted/updated/deleted. Eventing retrying the entire multi-document n1ql operation will lead to incorrect behaviour. Unfortunately, the best thing that can be done in this case is to bubble up the DML failure exception into Javascript & let user retry the operation in Javascript code as more context (whether the n1ql operation is single or multi document etc.) is available there. Which essentially means, the volume test javascript code needs to be modified to retry on DML failures for N1QL operations. It will ensure a test pass, hopefully.

            Forgot to put an update on the reason this was reopened: In addition to Jeelan Poola's comment above, currently I'm testing how can the user catch the DML failures and retry the query.
            Additionally, for N1QL queries where there can be multiple nested UPSERT operations, we need to check what error message does n1ql return if, say 1 / 100 INSERTS (that are a part of the same DML query) fail.
            Will update with further updates soon.

            abhishek.jindal Abhishek Jindal added a comment - Forgot to put an update on the reason this was reopened: In addition to Jeelan Poola 's comment above, currently I'm testing how can the user catch the DML failures and retry the query. Additionally, for N1QL queries where there can be multiple nested UPSERT operations, we need to check what error message does n1ql return if, say 1 / 100 INSERTS (that are a part of the same DML query) fail. Will update with further updates soon.

            Marco Greco I would like to reinitiate the conversation around DML_failures reported by query service during KV rebalance. There are 2 scenarios to discuss:

            Scenario 1: Single document insert / upsert scenario:

            As a background, QE has the following onUpdate handler running at the time of a KV rebalance:

            function OnUpdate(doc, meta) {
                try{
                    var query = UPSERT INTO n1ql_op.scope_1.`coll0` ( KEY, VALUE ) VALUES ( $meta.id ,"n1ql insert");
                }
                catch(e){
                    log("Query failed: ",e)
                }
            }
            

            Here we see that only 1 document is being upserted per query. This upsert can fail due to NMVB, etmpfail etc after which
            error propagation is from query -> LCB -> Eventing

            LCB doesn't recognise DML_Failure as a retriable error hence just hands over the decision to the application (here - Eventing).
            Eventing has 2 choices -

            • Pass this exception back to the JS code and leave it up to the user on what to do with DML_failure
            • Keep retrying until handler timeout. This doesn't give the user a chance to catch the exception.

            This is equivalent to the following setup where:

            User's application – interacts with --> SDK --> KV

            If SDK receives a NMVB, tmpfail etc from KV it retries this intenally instead of passing it back to the user's application.
            However, that's what is happening here - Eventing (as an application) is receiving a DML_Failure (as a result of NMVB) which is being passed over to the application without a retry.

            Same holds true for other reasons that DML_failure encapsulates. Hence, the argument here is that SDK should take the responsibity of retrying a DML_failure at least more than 1 time before passing it in to the application (here - Eventing).

            Scenario 2: Multiple document insert / upsert scenario:

            Now, let me modify my handler to the following:

            function OnUpdate(doc, meta) {
                try{
                    var query = UPSERT INTO n1ql_op.scope_1.`coll0` ( KEY foo, VALUE bar) select meta().id as foo, "some value" as bar from n1ql_op.scope_0.`coll0`;
                }
                catch(e){
                    log("Query failed: ",e)
                }
            }
            

            where n1ql_op.scope_0.`coll0` has, say 50K documents.
            Here, on every mutation from source bucket, eventing handler is running a batch of 50K upserts all while KV rebalance is going on.
            Within query engine down the line we hit the following function:

            https://github.com/couchbase/query/blob/master/datastore/couchbase/couchbase.go#L1832-L1949
            Worth noting line 1949 where query returns a DML error only when mPairs array is empty.
            As it can be seen from Line 1923 - Line 1940 , the only way mPairs can be empty is if upserts for none of the Key-Value pairs succeed.
            The takeaway is that if out of 50K upserts 49,999 upserts fail due to either nmvb, tmpfail, tmpoom, connection error etc and only 1 upsert succeeds the overall query will be successful. No DML_failure will be returned.

            So, LCB, Eventing and the Javascript world will be unaware of this failure and we would return a success.
            Is this intentional? If yes, how is the user eventing user suppose to know which of the rows failed?

            The bottomline of Scenarios 1 and 2 from Eventing's point of view is that all we need from SDK, query is an error code which can be passed over to the user to catch in their JS code. The behaviour of leaving errors like DML_failures to be retried by eventing (just like any application) without any retry from the SDK doesn't seem to be consistent. Moreover, what should be the error handling mechanism (if any) for scenarios with partial upsert failures?

            CC: SDK Team for more inputs : Sergey AvseyevMatt IngenthronMichael Nitschinger
            CC: For query's inputs: Kamini Jagtiani

            abhishek.jindal Abhishek Jindal added a comment - Marco Greco I would like to reinitiate the conversation around DML_failures reported by query service during KV rebalance. There are 2 scenarios to discuss: Scenario 1: Single document insert / upsert scenario: As a background, QE has the following onUpdate handler running at the time of a KV rebalance: function OnUpdate(doc, meta) { try{ var query = UPSERT INTO n1ql_op.scope_1.`coll0` ( KEY, VALUE ) VALUES ( $meta.id ,"n1ql insert"); } catch(e){ log("Query failed: ",e) } } Here we see that only 1 document is being upserted per query. This upsert can fail due to NMVB, etmpfail etc after which error propagation is from query -> LCB -> Eventing LCB doesn't recognise DML_Failure as a retriable error hence just hands over the decision to the application (here - Eventing). Eventing has 2 choices - Pass this exception back to the JS code and leave it up to the user on what to do with DML_failure Keep retrying until handler timeout. This doesn't give the user a chance to catch the exception. This is equivalent to the following setup where: User's application – interacts with --> SDK --> KV If SDK receives a NMVB, tmpfail etc from KV it retries this intenally instead of passing it back to the user's application. However, that's what is happening here - Eventing (as an application) is receiving a DML_Failure (as a result of NMVB) which is being passed over to the application without a retry. Same holds true for other reasons that DML_failure encapsulates . Hence, the argument here is that SDK should take the responsibity of retrying a DML_failure at least more than 1 time before passing it in to the application (here - Eventing). — Scenario 2: Multiple document insert / upsert scenario: Now, let me modify my handler to the following: function OnUpdate(doc, meta) { try{ var query = UPSERT INTO n1ql_op.scope_1.`coll0` ( KEY foo, VALUE bar) select meta().id as foo, "some value" as bar from n1ql_op.scope_0.`coll0`; } catch(e){ log("Query failed: ",e) } } where n1ql_op.scope_0.`coll0` has, say 50K documents. Here, on every mutation from source bucket, eventing handler is running a batch of 50K upserts all while KV rebalance is going on. Within query engine down the line we hit the following function: https://github.com/couchbase/query/blob/master/datastore/couchbase/couchbase.go#L1832-L1949 Worth noting line 1949 where query returns a DML error only when mPairs array is empty. As it can be seen from Line 1923 - Line 1940 , the only way mPairs can be empty is if upserts for none of the Key-Value pairs succeed. The takeaway is that if out of 50K upserts 49,999 upserts fail due to either nmvb, tmpfail, tmpoom, connection error etc and only 1 upsert succeeds the overall query will be successful. No DML_failure will be returned. So, LCB, Eventing and the Javascript world will be unaware of this failure and we would return a success. Is this intentional? If yes, how is the user eventing user suppose to know which of the rows failed? — The bottomline of Scenarios 1 and 2 from Eventing's point of view is that all we need from SDK, query is an error code which can be passed over to the user to catch in their JS code. The behaviour of leaving errors like DML_failures to be retried by eventing (just like any application) without any retry from the SDK doesn't seem to be consistent. Moreover, what should be the error handling mechanism (if any) for scenarios with partial upsert failures? CC: SDK Team for more inputs : Sergey Avseyev Matt Ingenthron Michael Nitschinger CC: For query's inputs: Kamini Jagtiani
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Marco Greco [ marco.greco ]
            marco.greco Marco Greco added a comment -

            The decision that has been taken by Jeelan Poola and Kamini Jagtiani remains the same: query has been amended to retry NMVB with delay for an amount equal to 2xnodesx100ms: in your test as I understand it, that equates to 1.2seconds, which should be plenty time for ns_server to supply the new vbMap.
            For 7.1, we will implement a new mechanism which will, in the first case involve getting the new node from the KV response, building a connection against that node if one is missing, and submitting the request to that node, and if that fails, then enter a retry mechanism as above.
            The point still remains that we cannot retry forever - at some stage a statement has to fail, because if it doesn't and all servicers enter a 'retry with the KV' state, n1ql will not be able to service any more requests, which would be an even worse outcome.
            Have a look at it another way - a KV get or put should last no more than a few milliseconds - query is already retrying for a couple of orders of magnitude more, and only passes the error back when there's no end in sight.

            marco.greco Marco Greco added a comment - The decision that has been taken by Jeelan Poola and Kamini Jagtiani remains the same: query has been amended to retry NMVB with delay for an amount equal to 2xnodesx100ms: in your test as I understand it, that equates to 1.2seconds, which should be plenty time for ns_server to supply the new vbMap. For 7.1, we will implement a new mechanism which will, in the first case involve getting the new node from the KV response, building a connection against that node if one is missing, and submitting the request to that node, and if that fails, then enter a retry mechanism as above. The point still remains that we cannot retry forever - at some stage a statement has to fail, because if it doesn't and all servicers enter a 'retry with the KV' state, n1ql will not be able to service any more requests, which would be an even worse outcome. Have a look at it another way - a KV get or put should last no more than a few milliseconds - query is already retrying for a couple of orders of magnitude more, and only passes the error back when there's no end in sight.
            marco.greco Marco Greco made changes -
            Assignee Marco Greco [ marco.greco ] Abhishek Jindal [ abhishek.jindal ]

            Thanks for the insights! The following is more of a question from SDK:

            Because eventing simply passes over the DML_failure from lcb to the JS code, with DML_Failures the user is left with very limited options.
            At best they can catch the exception and have a backoff retry or stash the mutation.
            Hence, before passing over this exception, it would make sense for LCB to add DML_Failure to the list of Retriable errors that can be retried at least once more before passing over to eventing. Would like to know the thoughts on the same. CC: Sergey Avseyev

            Marco Greco - Is the behaviour discussed under Scenario 2 for multi-document upserts correct? If yes, what should be the error handling mechanism (if any) from client's perspective for scenarios with partial upsert failures?

            abhishek.jindal Abhishek Jindal added a comment - Thanks for the insights! The following is more of a question from SDK: Because eventing simply passes over the DML_failure from lcb to the JS code, with DML_Failures the user is left with very limited options. At best they can catch the exception and have a backoff retry or stash the mutation. Hence, before passing over this exception, it would make sense for LCB to add DML_Failure to the list of Retriable errors that can be retried at least once more before passing over to eventing. Would like to know the thoughts on the same. CC: Sergey Avseyev — Marco Greco - Is the behaviour discussed under Scenario 2 for multi-document upserts correct? If yes, what should be the error handling mechanism (if any) from client's perspective for scenarios with partial upsert failures?
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Marco Greco [ marco.greco ]
            marco.greco Marco Greco added a comment - - edited

            Welcome to the world of DML failure retries!
            This, for context, has been one of the pain points of relational engines for the past 40 years.
            In essence, the discussion goes something along these lines:

            1. if it is a single row (or, in our case, document) mutation, it can be retried automatically, either by the application or the SDK
            2. if it is a multi row mutation, that has made precisely 0 changes, as above
            3. if it is a multi row mutation that has already made some changes, the options are either rollback and retry, assuming that transactions are available and have been used, but this may have loads of major drawbacks (rolling back gets more and more expensive with the size of the transaction), or have a knowledge of the last successful DML id, and restart from the next.
              Both can only be done at the application level.
              This discussion also assumes that the failure is transient (eg a lock, or lack of resources), and not terminal, such as a constraint violation - in which case, the failing row should be skipped entirely.

            When we had the whole - let's retry NMVB discussion with Jeelan - it was clearly stated that the problem that we were trying to solve was the retry of a single document DML for <name of important client redacted>.

            marco.greco Marco Greco added a comment - - edited Welcome to the world of DML failure retries! This, for context, has been one of the pain points of relational engines for the past 40 years. In essence, the discussion goes something along these lines: if it is a single row (or, in our case, document) mutation, it can be retried automatically, either by the application or the SDK if it is a multi row mutation, that has made precisely 0 changes, as above if it is a multi row mutation that has already made some changes, the options are either rollback and retry, assuming that transactions are available and have been used, but this may have loads of major drawbacks (rolling back gets more and more expensive with the size of the transaction), or have a knowledge of the last successful DML id, and restart from the next. Both can only be done at the application level. This discussion also assumes that the failure is transient (eg a lock, or lack of resources), and not terminal, such as a constraint violation - in which case, the failing row should be skipped entirely. When we had the whole - let's retry NMVB discussion with Jeelan - it was clearly stated that the problem that we were trying to solve was the retry of a single document DML for <name of important client redacted>.
            marco.greco Marco Greco made changes -
            Assignee Marco Greco [ marco.greco ] Abhishek Jindal [ abhishek.jindal ]
            jeelan.poola Jeelan Poola added a comment -

            Marco Greco Need a quick confirmation from Query side on below questions.

            Will Query guarantee that, in future too, for scenario [3], it will never return DML_FAILURE as it is the case today (looking at query code at https://github.com/couchbase/query/blob/master/datastore/couchbase/couchbase.go#L1832-L1949 where query actually returns success for a partially successful multi row mutation)?

            Conversely, does Query guarantee that, in future too, it will return DML_FAILURE only in case of [1] & [2] you listed?

            *If the answer is yes, we can request SDK Team to set the 'isRetriable' flag to 'true' when they return DML_FAILURE to eventing. *

            Above is in summary what we are looking for to close out on this topic. Eventing should retry only on an error that the underlying sub systems (SDK, Query) are providing a contractual guarantee that those errors are indeed retriable (from application side).

            jeelan.poola Jeelan Poola added a comment - Marco Greco Need a quick confirmation from Query side on below questions. Will Query guarantee that, in future too, for scenario [3] , it will never return DML_FAILURE as it is the case today (looking at query code at https://github.com/couchbase/query/blob/master/datastore/couchbase/couchbase.go#L1832-L1949 where query actually returns success for a partially successful multi row mutation)? Conversely, does Query guarantee that, in future too, it will return DML_FAILURE only in case of [1] & [2] you listed? *If the answer is yes, we can request SDK Team to set the 'isRetriable' flag to 'true' when they return DML_FAILURE to eventing. * Above is in summary what we are looking for to close out on this topic. Eventing should retry only on an error that the underlying sub systems (SDK, Query) are providing a contractual guarantee that those errors are indeed retriable (from application side).
            jeelan.poola Jeelan Poola made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Marco Greco [ marco.greco ]
            marco.greco Marco Greco added a comment -

            Jeelan Poola we can't guarantee that.
            The behaviour is always going to be - N1QL retries against the node in the vbMap returned by the KV, if that fails, N1QL will retry obtaining an amended vbMap from ns_server (with backoff), but we have to wait a finite amount of time, as we do now, otherwise the potential is there for the request to never terminate, and if enough requests stop working, for the N1QL node not to service anything.
            Again - to put it in perspective, you are seeing right now NMVBs after N1QL has waited for nearly a thousand times longer than a KV request should reasonably take.
            We are also waiting at each retry the amount of time that ns_server should reasonably take to supply a new vbMap (100ms), and we are doing it, in your environment, 12 times.
            If ns_server doesn't amend its own state after a very generous delay, we have to give up at some stage.
            We could guarantee that we will not return a DML_FAILURE only if ns_server could guarantee that we will not get more than n <choose your value here> NMVB on retry with backoff.

            marco.greco Marco Greco added a comment - Jeelan Poola we can't guarantee that. The behaviour is always going to be - N1QL retries against the node in the vbMap returned by the KV, if that fails, N1QL will retry obtaining an amended vbMap from ns_server (with backoff), but we have to wait a finite amount of time, as we do now, otherwise the potential is there for the request to never terminate, and if enough requests stop working, for the N1QL node not to service anything. Again - to put it in perspective, you are seeing right now NMVBs after N1QL has waited for nearly a thousand times longer than a KV request should reasonably take. We are also waiting at each retry the amount of time that ns_server should reasonably take to supply a new vbMap (100ms), and we are doing it, in your environment, 12 times. If ns_server doesn't amend its own state after a very generous delay, we have to give up at some stage. We could guarantee that we will not return a DML_FAILURE only if ns_server could guarantee that we will not get more than n <choose your value here> NMVB on retry with backoff.
            marco.greco Marco Greco made changes -
            Assignee Marco Greco [ marco.greco ] Abhishek Jindal [ abhishek.jindal ]
            marco.greco Marco Greco added a comment -

            So, to clarify - error 12009, with cause, will only be thrown if there are no mutations made.
            As long as you are aware that 12009 could be thrown for all sorts of reason, including "duplicate key" and "key not found", and not just transient errors, you could use that for the retry flag.

            marco.greco Marco Greco added a comment - So, to clarify - error 12009, with cause, will only be thrown if there are no mutations made. As long as you are aware that 12009 could be thrown for all sorts of reason, including "duplicate key" and "key not found", and not just transient errors, you could use that for the retry flag.
            jeelan.poola Jeelan Poola added a comment -

            Thank you Marco Greco! That should help in deciding the 'retriability' of DML_FAILURE error.

            Michael Nitschinger Sergey Avseyev Does SDK already disambiguate 12009-for-transient-errors[a] from 12009-for-duplicate-key-or-key-not-found-and-other-specific-errors[b] and map 12009 to DML_FAILURE only in case of [a]? If yes, could you please set 'retriable' flag to 'true' when returning error to application/eventing?

            jeelan.poola Jeelan Poola added a comment - Thank you Marco Greco ! That should help in deciding the 'retriability' of DML_FAILURE error. Michael Nitschinger Sergey Avseyev Does SDK already disambiguate 12009-for-transient-errors [a] from 12009-for-duplicate-key-or-key-not-found-and-other-specific-errors [b] and map 12009 to DML_FAILURE only in case of [a] ? If yes, could you please set 'retriable' flag to 'true' when returning error to application/eventing?
            jeelan.poola Jeelan Poola made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Michael Nitschinger [ daschl ]

            Jeelan Poola with recent SDKs, as we've discussed on a different ticket w/ eventing and query:

            12009 with string "CAS mismatch" => CasMismatchException
            else all 12009 DmlFailureException

            the SDKs do not do any further disambiguation of any cause string at the moment, this would have to be spec'ed out and rolled out across all the SDKs consistently. We need a proposal from query how we should handle certain combinations in this case.

            daschl Michael Nitschinger added a comment - Jeelan Poola with recent SDKs, as we've discussed on a different ticket w/ eventing and query: 12009 with string "CAS mismatch" => CasMismatchException else all 12009 DmlFailureException the SDKs do not do any further disambiguation of any cause string at the moment, this would have to be spec'ed out and rolled out across all the SDKs consistently. We need a proposal from query how we should handle certain combinations in this case.
            daschl Michael Nitschinger made changes -
            Assignee Michael Nitschinger [ daschl ] Jeelan Poola [ jeelan.poola ]
            lynn.straus Lynn Straus made changes -
            Due Date 21/May/21 28/May/21
            lynn.straus Lynn Straus made changes -
            Link This issue blocks MB-44894 [ MB-44894 ]
            lynn.straus Lynn Straus made changes -
            Labels affects-cc-testing volume-test affects-cc-testing approved-for-7.0.0 volume-test
            jeelan.poola Jeelan Poola added a comment - - edited

            Thank you Michael Nitschinger!

            Marco Greco Could you please help clarify how SDK can further disambiguate between

            1. 12009-due-to-transient-errors-that-are-retriable-by-application
            2. 12009-for-duplicate-key-or-key-not-found-and-other-specific-errors-that-are-not-retriable-by-application
              using cause string? Are there changes required on Query side to be able to do this?

            Based on your inputs, we can log appropriate tickets for tracking. Thanks a lot in advance!

            jeelan.poola Jeelan Poola added a comment - - edited Thank you Michael Nitschinger ! Marco Greco Could you please help clarify how SDK can further disambiguate between 12009-due-to-transient-errors-that-are-retriable-by-application 12009-for-duplicate-key-or-key-not-found-and-other-specific-errors-that-are-not-retriable-by-application using cause string? Are there changes required on Query side to be able to do this? Based on your inputs, we can log appropriate tickets for tracking. Thanks a lot in advance!
            jeelan.poola Jeelan Poola made changes -
            Assignee Jeelan Poola [ jeelan.poola ] Marco Greco [ marco.greco ]
            marco.greco Marco Greco added a comment -

            Much like for CAS mismatch, the issue is reported in the Cause: part of the text.
            For vNext we plan to remove causes and such from the actual text, and enrich the error with fields like "cause" and "action".
            Those fields are already part of the error structure, but haven't been externalized (mainly because the developer that started the work has left).

            marco.greco Marco Greco added a comment - Much like for CAS mismatch, the issue is reported in the Cause: part of the text. For vNext we plan to remove causes and such from the actual text, and enrich the error with fields like "cause" and "action". Those fields are already part of the error structure, but haven't been externalized (mainly because the developer that started the work has left).
            marco.greco Marco Greco made changes -
            Assignee Marco Greco [ marco.greco ] Abhishek Jindal [ abhishek.jindal ]
            jeelan.poola Jeelan Poola made changes -
            Link This issue depends on CCBC-1411 [ CCBC-1411 ]

            Just to clarify further, for any DML failure query can return an error object as follows:

            {
                "code": 12009,
                "msg": "DML Error, possible causes include concurrent modificationFailed to perform INSERT - cause: Duplicate Key doc1"
              }
            

            Currently SDKs check the "code" and check whether "msg" has "CAS mismatch" substring. However the "cause: Duplicate Key doc1" section of msg is not being parsed. Ref : https://github.com/couchbase/libcouchbase/blob/c4a5430819aedf2985bed353428f134658127f47/src/n1ql/n1ql.cc#L582-L587

            LCB can use "cause" to differentiate between document-not-found / document-already-exists and tmpfail, nmvb etc and easily decide whether this is retriable or not. Eventing can then simply retry those retriable DML failures

            abhishek.jindal Abhishek Jindal added a comment - Just to clarify further, for any DML failure query can return an error object as follows: { "code": 12009, "msg": "DML Error, possible causes include concurrent modificationFailed to perform INSERT - cause: Duplicate Key doc1" } Currently SDKs check the "code" and check whether "msg" has "CAS mismatch" substring. However the "cause: Duplicate Key doc1" section of msg is not being parsed. Ref : https://github.com/couchbase/libcouchbase/blob/c4a5430819aedf2985bed353428f134658127f47/src/n1ql/n1ql.cc#L582-L587 LCB can use "cause" to differentiate between document-not-found / document-already-exists and tmpfail, nmvb etc and easily decide whether this is retriable or not. Eventing can then simply retry those retriable DML failures
            jeelan.poola Jeelan Poola made changes -
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]

            Build couchbase-server-7.0.0-5249 contains eventing commit 7372718 with commit message:
            MB-45967:

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5249 contains eventing commit 7372718 with commit message: MB-45967 :

            Build couchbase-server-7.0.0-5249 contains eventing commit dbaeeda with commit message:
            MB-45967 : Disambiguate between retriable and non-retriable DML errors

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5249 contains eventing commit dbaeeda with commit message: MB-45967 : Disambiguate between retriable and non-retriable DML errors

            Build couchbase-server-7.0.1-5811 contains eventing commit 7372718 with commit message:
            MB-45967:

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.1-5811 contains eventing commit 7372718 with commit message: MB-45967 :

            Build couchbase-server-7.0.1-5811 contains eventing commit dbaeeda with commit message:
            MB-45967 : Disambiguate between retriable and non-retriable DML errors

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.1-5811 contains eventing commit dbaeeda with commit message: MB-45967 : Disambiguate between retriable and non-retriable DML errors
            sujay.gad Sujay Gad added a comment - Verified using 7.0.0 - 5249 Test run - http://qa.sc.couchbase.com/job/component_systest_launcher_2/340/console
            sujay.gad Sujay Gad made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]
            ingenthr Matt Ingenthron made changes -
            Link This issue relates to MB-47366 [ MB-47366 ]
            ray.cardillo Ray Cardillo made changes -
            Link This issue relates to CBD-4565 [ CBD-4565 ]

            People

              abhishek.jindal Abhishek Jindal
              vikas.chaudhary Vikas Chaudhary
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty