Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44784

[Volume Test]: Timers not getting fired

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0 build 4619 

      Test: Eventing Volume

      • Create 9 node cluster ( Eventing:3 , Kv:4,index:1,query:1)
      • Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl)
      • deploy bucket op, timers and N1ql
      • Rebalance in 1 eventing node
      • Start CRUD on bucket_op where no handler are deployed
      • Verify handlers
      • Load more data
      • Add to 2 KV nodes
      • deploy curl and SBM handlers
      • Rebalance out 2 kv nodes
      • Verify all handlers
      • Pause bucket op and timers
      • Swap 2 kv nodes 1 after another
      • load more data
      • Rebalance in 2 eventing nodes
      • Resume bucket op and timers
      • Verify all handlers
      • load more data
      • rebalance out 2 eventing nodes
      • verify all handlers
      • load more data
      • Swap 2 eventing nodes 1 after another
      • Verify all handlers
      • load all data
      • Rebalance in kv and eventing node
      • Verify all handlers
      • load all data
      • Rebalance out 1 kv 1 eventing
      • Verify all handlers
      • load all data
      • Swap 1 kv and 1 eventing
      • Verify all handlers

      Observed that all verification for timer handler failed due to data miss match and seeing the documents in metadata

      select count(*) from metadata.scope_1.coll0 where 
      meta().id like 'eventing::%:sp' and sta != stp; --> 128
       
      select count(*) from metadata.scope_1.coll1 where meta().id like 'eventing::%:sp' and sta != stp; --> 128
       
      select count(*) from metadata.scope_1.coll2 where meta().id like 'eventing::%:sp' and sta != stp; --> 128
      
      

       

      Handler Source Scope Destination Scope
      timers0_0 timer_op.scope_0.coll0(4095000) timer_op.scope_1.coll0(4094498)
      timers1_0 timer_op.scope_0.coll1(4091100) timer_op.scope_1.coll1(4089924)
      timers2_0 timer_op.scope_0.coll2(4089600) timer_op.scope_1.coll2(4089127)

       

      N1ql queries failing with CAS miss match

      2021-03-05T13:20:27.900-08:00 [INFO] "Query failed: " {"message":"SDK error : LCB_ERR_CAS_MISMATCH (209) Query error : {\n\"requestID\": \"1bc36320-7bb6-4e5f-a47d-d33f7c235c62\",\n\"clientContextID\": \"4@n1ql0_0.js(OnUpdate)\",\n\"signature\": null,\n\"results\": [\n],\n\"errors\": [{\"code\":12009,\"msg\":\"DML Error, possible causes include CAS mismatch or concurrent modificationFailed to perform UPSERT - cause: dial tcp 172.23.105.25:11210: connect: cannot assign requested address\"}],\n\"status\": \"errors\",\n\"metrics\": {\"elapsedTime\": \"23.135053ms\",\"executionTime\": \"23.080655ms\",\"resultCount\": 0,\"resultSize\": 0,\"serviceLoad\": 15,\"errorCount\": 1}\n}\n","stack":"Error\n    at N1QL (<anonymous>)\n    at OnUpdate (n1ql0_0.js:4:21)"} 2021-03-05T13:20:27.875-08:00 [INFO] "Query failed: " {"message":"SDK error : LCB_ERR_CAS_MISMATCH (209) Query error : {\n\"requestID\": \"1570ce91-f811-469c-b4b3-0ac8f1b57cca\",\n\"clientContextID\": \"4@n1ql0_0.js(OnUpdate)\",\n\"signature\": null,\n\"results\": [\n],\n\"errors\": [{\"code\":12009,\"msg\":\"DML Error, possible causes include CAS mismatch or concurrent modificationFailed to perform UPSERT - cause: dial tcp 172.23.105.25:11210: connect: cannot assign requested address\"}],\n\"status\": \"errors\",\n\"metrics\": {\"elapsedTime\": \"18.964561ms\",\"executionTime\": \"18.922147ms\",\"resultCount\": 0,\"resultSize\": 0,\"serviceLoad\": 9,\"errorCount\": 1}\n}\n","stack":"Error\n    at N1QL (<anonymous>)\n    at OnUpdate (n1ql0_0.js:4:21)"}  

      Attachments

        1. 693280967_timer2.log
          273 kB
        2. MB-44784.rtf
          34 kB
        3. metadata_items.png
          metadata_items.png
          452 kB
        4. metadata_items-1.png
          metadata_items-1.png
          452 kB
        5. metadata_scope1_coll1_1.png
          metadata_scope1_coll1_1.png
          352 kB
        6. metadata_scope1_coll1_2.png
          metadata_scope1_coll1_2.png
          318 kB
        7. metadata_scope1_coll1_3.png
          metadata_scope1_coll1_3.png
          300 kB
        8. metadata_scope1_coll1_4.png
          metadata_scope1_coll1_4.png
          301 kB
        9. RunLogs.txt
          239 kB
        10. Screenshot 2021-03-12 at 1.41.58 PM.png
          Screenshot 2021-03-12 at 1.41.58 PM.png
          70 kB
        11. Screenshot 2021-03-12 at 3.23.20 PM.png
          Screenshot 2021-03-12 at 3.23.20 PM.png
          153 kB
        12. Screenshot 2021-04-08 at 11.32.59 AM.png
          Screenshot 2021-04-08 at 11.32.59 AM.png
          606 kB
        13. timer_2 metadata.zip
          3.59 MB
        14. timer_mismatch.png
          timer_mismatch.png
          356 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Closing all non-fixed issues

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Closing all non-fixed issues

            Resolving this as a dup of MB-45967 as these share the same cause. Once the fix is in for MB-45967 would request a rerun of the volume test for timers handler to verify.

            abhishek.jindal Abhishek Jindal added a comment - Resolving this as a dup of MB-45967 as these share the same cause. Once the fix is in for MB-45967 would request a rerun of the volume test for timers handler to verify.
            ritam.sharma Ritam Sharma added a comment - - edited Update - Build - 7.0.0-5127 - Cannot reproduce the issue - http://qa.sc.couchbase.com/job/component_systest_launcher_2/317/consoleFull - Run log here - https://issues.couchbase.com/secure/attachment/139949/RunLogs.txt
            ritam.sharma Ritam Sharma added a comment -

            QE Update - Test is running with data requested by Dev -
            http://qa.sc.couchbase.com/job/component_systest_launcher_2/317/ - Running for 5 hours, not reproduced on - 7.0.0-5127

            ritam.sharma Ritam Sharma added a comment - QE Update - Test is running with data requested by Dev - http://qa.sc.couchbase.com/job/component_systest_launcher_2/317/ - Running for 5 hours, not reproduced on - 7.0.0-5127

            In this case I do not see the collection getting dropped. So pretty sure that CCBC-1400 is not the cause here.
            From the logs the mismatch of 2 docs started off somewhere between 2021-05-04T02:45:20 and 2021-05-04T03:05:20 on node 214, vbucket 289:

            Documents on src collection = 5424 while on dst collection : 5422.
            At this time vb 289 was owned by eventing node .186. Additionally at 2021-05-04T02:45:20 2 eventing nodes .182 and .185 were being rebalanced out. This rebalance lasted a minute at max - ended by 2021-05-04T02:46:54.

            By 2021-05-04T02:45:58 we got streamend for this vb. At this time seq no - 7382 was the last sent sequence number from Go producer side to CPP consumer side:

            2021-05-04T02:45:58.400-07:00 [Info] Consumer::processDCPEvents [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 got STREAMEND
            2021-05-04T02:45:58.400-07:00 [Info] Consumer::sendVbFilterData [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7382 sending filter data to C++
            

            CPP side acknowledged that out of 7382, 7354 is the highest that has been processed:

            2021-05-04T02:45:58.498-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7354 skip_ack: 0 sending filter ack to Go
            

            A filter is now set for 7354. Consumer notices that vb: 289 should still be owned by same consumer so DCP stream should be reclaimed. Hence, opens up a new stream:

            2021-05-04T02:46:00.818-07:00 [Info] Consumer::dcpRequestStreamHandle [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 DCP stream start vbKvAddr: 172.23.97.214:11210 vbuuid: 60405539196205 startSeq !metadata_items.png|thumbnail! : 7354 snapshotStart: 7354 snapshotEnd: 7354 Manifest id: c
            

            This confirms no mutation has been dropped as part of filtering.
            Metadata collection for this handler suggests no timers are remaining to be fired. Note the numbers before and after markers:

            Only possibility is that the test maybe adding some documents that are duplicates.
            It would help if:

            • Before deploying the handler "app_log_max_size" is set to a large value such as 500 MB (524288000 bytes) so that any application logs do not rotate.
            • Deploy a dummy handler which listens on same source collection and just logs "log("Doc created/updated", meta.id);" in the handler code.
            • If we observe a mismatch by the end of test, cbcollect + application logs from all eventing nodes will help narrowing down further.
            abhishek.jindal Abhishek Jindal added a comment - In this case I do not see the collection getting dropped. So pretty sure that CCBC-1400 is not the cause here. From the logs the mismatch of 2 docs started off somewhere between 2021-05-04T02:45:20 and 2021-05-04T03:05:20 on node 214, vbucket 289: Documents on src collection = 5424 while on dst collection : 5422. At this time vb 289 was owned by eventing node .186. Additionally at 2021-05-04T02:45:20 2 eventing nodes .182 and .185 were being rebalanced out. This rebalance lasted a minute at max - ended by 2021-05-04T02:46:54 . By 2021-05-04T02:45:58 we got streamend for this vb. At this time seq no - 7382 was the last sent sequence number from Go producer side to CPP consumer side: 2021-05-04T02:45:58.400-07:00 [Info] Consumer::processDCPEvents [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 got STREAMEND 2021-05-04T02:45:58.400-07:00 [Info] Consumer::sendVbFilterData [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7382 sending filter data to C++ CPP side acknowledged that out of 7382, 7354 is the highest that has been processed: 2021-05-04T02:45:58.498-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7354 skip_ack: 0 sending filter ack to Go A filter is now set for 7354. Consumer notices that vb: 289 should still be owned by same consumer so DCP stream should be reclaimed. Hence, opens up a new stream: 2021-05-04T02:46:00.818-07:00 [Info] Consumer::dcpRequestStreamHandle [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 DCP stream start vbKvAddr: 172.23.97.214:11210 vbuuid: 60405539196205 startSeq !metadata_items.png|thumbnail! : 7354 snapshotStart: 7354 snapshotEnd: 7354 Manifest id: c This confirms no mutation has been dropped as part of filtering. Metadata collection for this handler suggests no timers are remaining to be fired. Note the numbers before and after markers: Only possibility is that the test maybe adding some documents that are duplicates. It would help if: Before deploying the handler "app_log_max_size" is set to a large value such as 500 MB (524288000 bytes) so that any application logs do not rotate. Deploy a dummy handler which listens on same source collection and just logs "log("Doc created/updated", meta.id);" in the handler code. If we observe a mismatch by the end of test, cbcollect + application logs from all eventing nodes will help narrowing down further.

            People

              abhishek.jindal Abhishek Jindal
              vikas.chaudhary Vikas Chaudhary
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty