Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44784

[Volume Test]: Timers not getting fired

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0 build 4619 

      Test: Eventing Volume

      • Create 9 node cluster ( Eventing:3 , Kv:4,index:1,query:1)
      • Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl)
      • deploy bucket op, timers and N1ql
      • Rebalance in 1 eventing node
      • Start CRUD on bucket_op where no handler are deployed
      • Verify handlers
      • Load more data
      • Add to 2 KV nodes
      • deploy curl and SBM handlers
      • Rebalance out 2 kv nodes
      • Verify all handlers
      • Pause bucket op and timers
      • Swap 2 kv nodes 1 after another
      • load more data
      • Rebalance in 2 eventing nodes
      • Resume bucket op and timers
      • Verify all handlers
      • load more data
      • rebalance out 2 eventing nodes
      • verify all handlers
      • load more data
      • Swap 2 eventing nodes 1 after another
      • Verify all handlers
      • load all data
      • Rebalance in kv and eventing node
      • Verify all handlers
      • load all data
      • Rebalance out 1 kv 1 eventing
      • Verify all handlers
      • load all data
      • Swap 1 kv and 1 eventing
      • Verify all handlers

      Observed that all verification for timer handler failed due to data miss match and seeing the documents in metadata

      select count(*) from metadata.scope_1.coll0 where 
      meta().id like 'eventing::%:sp' and sta != stp; --> 128
       
      select count(*) from metadata.scope_1.coll1 where meta().id like 'eventing::%:sp' and sta != stp; --> 128
       
      select count(*) from metadata.scope_1.coll2 where meta().id like 'eventing::%:sp' and sta != stp; --> 128
      
      

       

      Handler Source Scope Destination Scope
      timers0_0 timer_op.scope_0.coll0(4095000) timer_op.scope_1.coll0(4094498)
      timers1_0 timer_op.scope_0.coll1(4091100) timer_op.scope_1.coll1(4089924)
      timers2_0 timer_op.scope_0.coll2(4089600) timer_op.scope_1.coll2(4089127)

       

      N1ql queries failing with CAS miss match

      2021-03-05T13:20:27.900-08:00 [INFO] "Query failed: " {"message":"SDK error : LCB_ERR_CAS_MISMATCH (209) Query error : {\n\"requestID\": \"1bc36320-7bb6-4e5f-a47d-d33f7c235c62\",\n\"clientContextID\": \"4@n1ql0_0.js(OnUpdate)\",\n\"signature\": null,\n\"results\": [\n],\n\"errors\": [{\"code\":12009,\"msg\":\"DML Error, possible causes include CAS mismatch or concurrent modificationFailed to perform UPSERT - cause: dial tcp 172.23.105.25:11210: connect: cannot assign requested address\"}],\n\"status\": \"errors\",\n\"metrics\": {\"elapsedTime\": \"23.135053ms\",\"executionTime\": \"23.080655ms\",\"resultCount\": 0,\"resultSize\": 0,\"serviceLoad\": 15,\"errorCount\": 1}\n}\n","stack":"Error\n    at N1QL (<anonymous>)\n    at OnUpdate (n1ql0_0.js:4:21)"} 2021-03-05T13:20:27.875-08:00 [INFO] "Query failed: " {"message":"SDK error : LCB_ERR_CAS_MISMATCH (209) Query error : {\n\"requestID\": \"1570ce91-f811-469c-b4b3-0ac8f1b57cca\",\n\"clientContextID\": \"4@n1ql0_0.js(OnUpdate)\",\n\"signature\": null,\n\"results\": [\n],\n\"errors\": [{\"code\":12009,\"msg\":\"DML Error, possible causes include CAS mismatch or concurrent modificationFailed to perform UPSERT - cause: dial tcp 172.23.105.25:11210: connect: cannot assign requested address\"}],\n\"status\": \"errors\",\n\"metrics\": {\"elapsedTime\": \"18.964561ms\",\"executionTime\": \"18.922147ms\",\"resultCount\": 0,\"resultSize\": 0,\"serviceLoad\": 9,\"errorCount\": 1}\n}\n","stack":"Error\n    at N1QL (<anonymous>)\n    at OnUpdate (n1ql0_0.js:4:21)"}  

      Attachments

        1. 693280967_timer2.log
          273 kB
        2. MB-44784.rtf
          34 kB
        3. metadata_items.png
          metadata_items.png
          452 kB
        4. metadata_items-1.png
          metadata_items-1.png
          452 kB
        5. metadata_scope1_coll1_1.png
          metadata_scope1_coll1_1.png
          352 kB
        6. metadata_scope1_coll1_2.png
          metadata_scope1_coll1_2.png
          318 kB
        7. metadata_scope1_coll1_3.png
          metadata_scope1_coll1_3.png
          300 kB
        8. metadata_scope1_coll1_4.png
          metadata_scope1_coll1_4.png
          301 kB
        9. RunLogs.txt
          239 kB
        10. Screenshot 2021-03-12 at 1.41.58 PM.png
          Screenshot 2021-03-12 at 1.41.58 PM.png
          70 kB
        11. Screenshot 2021-03-12 at 3.23.20 PM.png
          Screenshot 2021-03-12 at 3.23.20 PM.png
          153 kB
        12. Screenshot 2021-04-08 at 11.32.59 AM.png
          Screenshot 2021-04-08 at 11.32.59 AM.png
          606 kB
        13. timer_2 metadata.zip
          3.59 MB
        14. timer_mismatch.png
          timer_mismatch.png
          356 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            vikas.chaudhary Vikas Chaudhary created issue -
            ritam.sharma Ritam Sharma made changes -
            Field Original Value New Value
            Summary Volume: Timers not getting fired [Volume Test]: Timers not getting fired
            jeelan.poola Jeelan Poola made changes -
            Assignee Jeelan Poola [ jeelan.poola ] Abhishek Jindal [ abhishek.jindal ]
            vikas.chaudhary Vikas Chaudhary made changes -

            Rerun on 7.0.0 build 4669

            Scope_0 is source ,scope_1 is destination 

             

            volume_4669 → http://supportal.couchbase.com/snapshot/a5c987add8e2ba7de815826265fca1e4::0

            vikas.chaudhary Vikas Chaudhary added a comment - Rerun on 7.0.0 build 4669 Scope_0 is source ,scope_1 is destination    volume_4669  →  http://supportal.couchbase.com/snapshot/a5c987add8e2ba7de815826265fca1e4::0
            vikas.chaudhary Vikas Chaudhary made changes -
            Attachment backup_metadaa.zip [ 129842 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment timer_2 metadata.zip [ 130648 ]
            abhishek.jindal Abhishek Jindal made changes -

            Captured a new set of logs around 30 minutes after last rebalance :

            Volume_4669_take2 - https://supportal.couchbase.com/snapshot/a5c987add8e2ba7de815826265fca1e4::1

            Observed that while timers for timer_0 and _1 have caught up, there's still a huge difference for timer_2. Relevant metadata: timer_2 metadata.zip

            Document count for metadata.scope_1.coll2 (metadata collection for timer_2):

            122.6K docs = 1024 mapping docs + 128 timer spans + 52888 context keys + 66010 alarms + 2570 root keys

            abhishek.jindal Abhishek Jindal added a comment - Captured a new set of logs around 30 minutes after last rebalance : Volume_4669_take2 - https://supportal.couchbase.com/snapshot/a5c987add8e2ba7de815826265fca1e4::1 Observed that while timers for timer_0 and _1 have caught up, there's still a huge difference for timer_2. Relevant metadata: timer_2 metadata.zip Document count for metadata.scope_1.coll2 (metadata collection for timer_2): 122.6K docs = 1024 mapping docs + 128 timer spans + 52888 context keys + 66010 alarms + 2570 root keys
            abhishek.jindal Abhishek Jindal made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment 693280967_timer2.log [ 131137 ]
            abhishek.jindal Abhishek Jindal added a comment - - edited

            A longer explanation is needed here from eventing's perspective as this wasn't very direct to spot. For LCB folks, TLDR - Only the last section of this comment is relevant:

            Taking some examples from the timer_2 metadata.zip , we can see 10 timer partitions - from partition 688 to 760 for which the span.start hasn't shrinked (moved forward) at all. Example:

              {
                "coll2": {
                  "sta": 1615536930,
                  "stp": 1615533563
                },
                "id": "eventing::2093318803:tm:688:sp"
              },
            

            Epoch 1615536930 = 12 March 2021 00:15:30 AM UTC - 08:00 .

            On the contrary, note the timer span for a "good" partition - example 682:

              {
                "coll2": {
                  "sta": 1615542460,
                  "stp": 1615533563
                },
                "id": "eventing::2093318803:tm:680:sp"
              },
            

            Epoch 1615542460 = 12 March 2021 01:47:40 AM UTC - 08:00

            So, the bottomline is that for 10 partitions (688 - 760) the last time a timer scan was successful was 12 March 2021 00:15:30 AM UTC - 08:00 . Or we can also say that the first failure happened while scanning root keys starting at 12 March 2021 00:15:37 AM UTC - 08:00 (after 7 second boundary).

            If one checks the zip file above, we see that all context keys, alarms and root keys belong to those 10 partitions.

            On to the next section - Why is timer scan not processing these root keys?

            Partitions 688 - 760 are assigned to eventing node 172.23.121.165. Given that the prolonged effect of the issue is seen for timers2_0 function, lets check the eventing-consumer logs only for this function: 693280967_timer2.log

            The above is an extract of logs for worker timer2's worker thread using LCB instance with ID: 693280967

            Here we can see that from 2021-03-12T00:15:55 onwards (reminder - around the last time the timer scan for partition 688 and others was successful), the libcouchbase instance used by the worker thread belonging to function timer2 on eventing node 121.165 started reporting timeouts for get_cluster_config calls destined only to KV node - 172.23.104.16 (No problem with calls to other KV nodes):

             2021-03-12T00:15:55.132-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,cccp L:187 I:693280967] <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)
             
             2021-03-12T00:16:51.131-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b023c","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
            ...
            ...
             2021-03-12T01:05:12.265-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b093b","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
            ...
            ...
             2021-03-12T01:05:13.279-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:1108 I:693280967] <172.23.104.18:11210> (CTX=0x7f261c011d50,memcached, SRV=0x7f262c1afe60,IX=2) Got socket error LCB_ERR_SOCKET_SHUTDOWN (1025)
            ...
            ...
             2021-03-12T01:05:13.280-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,connection L:153 I:693280967] <172.23.104.18:11210> (SOCK=9607dcb76e53922a) Connected established
            ...
            ...
            ...
            2021-03-12T01:05:14.766-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b093d","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
             2021-03-12T01:05:15.280-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,cccp L:187 I:693280967] <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)
            ...
            ...
            ...
             2021-03-12T01:53:40.799-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b11a1","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
            

            Note the log snippet above - The lcb errors started at 2021-03-12T00:15:55, continued till 2021-03-12T01:05:12. At this point, there was a socket shutdown at {{ 2021-03-12T01:05:13}}.
            A new connection was then established in the very next second at 2021-03-12T01:05:13.
            However, even with the new connection, similar timeouts for get_cluster_config continued until the logs were collected.

            From the perspective of memcached on node 172.23.104.16, there are no slow GET_CLUSTER_CONFIG calls. Slowest is 102 ms:

             [ 10.75 - 102.40]ms (99.9995%)       1|
            

            Additionally, eventing is now passing a 58 second timeout to libcouchbase for any kind of KV operation. Worth noting that multiple libcouchbase seems to be retrying every 10 seconds.

            Effect of these timeouts on eventing:

            With such continuous timeouts, no timer documents in metadata bucket owned by KV node 104.16 can be accessed. This is why timer scan ultimately gives up on those partitions.

            Worth noting that this exact issue has also been noticed in MB-44913 where such get_cluster_config timeouts were preventing handler to be paused.

            Michael Nitschinger given that root cause in this case matches MB-44913 assigned to you.

            Please let me know if further clarification is needed.

            abhishek.jindal Abhishek Jindal added a comment - - edited A longer explanation is needed here from eventing's perspective as this wasn't very direct to spot. For LCB folks, TLDR - Only the last section of this comment is relevant: Taking some examples from the timer_2 metadata.zip , we can see 10 timer partitions - from partition 688 to 760 for which the span.start hasn't shrinked (moved forward) at all. Example: { "coll2": { "sta": 1615536930, "stp": 1615533563 }, "id": "eventing::2093318803:tm:688:sp" }, Epoch 1615536930 = 12 March 2021 00:15:30 AM UTC - 08:00 . On the contrary, note the timer span for a "good" partition - example 682: { "coll2": { "sta": 1615542460, "stp": 1615533563 }, "id": "eventing::2093318803:tm:680:sp" }, Epoch 1615542460 = 12 March 2021 01:47:40 AM UTC - 08:00 So, the bottomline is that for 10 partitions (688 - 760) the last time a timer scan was successful was 12 March 2021 00:15:30 AM UTC - 08:00 . Or we can also say that the first failure happened while scanning root keys starting at 12 March 2021 00:15:37 AM UTC - 08:00 (after 7 second boundary). If one checks the zip file above, we see that all context keys, alarms and root keys belong to those 10 partitions . — On to the next section - Why is timer scan not processing these root keys? Partitions 688 - 760 are assigned to eventing node 172.23.121.165. Given that the prolonged effect of the issue is seen for timers2_0 function, lets check the eventing-consumer logs only for this function: 693280967_timer2.log The above is an extract of logs for worker timer2's worker thread using LCB instance with ID: 693280967 — Here we can see that from 2021-03-12T00:15:55 onwards (reminder - around the last time the timer scan for partition 688 and others was successful), the libcouchbase instance used by the worker thread belonging to function timer2 on eventing node 121.165 started reporting timeouts for get_cluster_config calls destined only to KV node - 172.23.104.16 (No problem with calls to other KV nodes): 2021-03-12T00:15:55.132-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,cccp L:187 I:693280967] <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)   2021-03-12T00:16:51.131-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b023c","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000} ... ... 2021-03-12T01:05:12.265-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b093b","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000} ... ... 2021-03-12T01:05:13.279-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:1108 I:693280967] <172.23.104.18:11210> (CTX=0x7f261c011d50,memcached, SRV=0x7f262c1afe60,IX=2) Got socket error LCB_ERR_SOCKET_SHUTDOWN (1025) ... ... 2021-03-12T01:05:13.280-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,connection L:153 I:693280967] <172.23.104.18:11210> (SOCK=9607dcb76e53922a) Connected established ... ... ... 2021-03-12T01:05:14.766-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b093d","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000} 2021-03-12T01:05:15.280-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,cccp L:187 I:693280967] <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201) ... ... ... 2021-03-12T01:53:40.799-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b11a1","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000} Note the log snippet above - The lcb errors started at 2021-03-12T00:15:55 , continued till 2021-03-12T01:05:12 . At this point, there was a socket shutdown at {{ 2021-03-12T01:05:13}}. A new connection was then established in the very next second at 2021-03-12T01:05:13 . However, even with the new connection, similar timeouts for get_cluster_config continued until the logs were collected. From the perspective of memcached on node 172.23.104.16 , there are no slow GET_CLUSTER_CONFIG calls. Slowest is 102 ms: [ 10.75 - 102.40]ms (99.9995%) 1| Additionally, eventing is now passing a 58 second timeout to libcouchbase for any kind of KV operation. Worth noting that multiple libcouchbase seems to be retrying every 10 seconds. — Effect of these timeouts on eventing: With such continuous timeouts, no timer documents in metadata bucket owned by KV node 104.16 can be accessed. This is why timer scan ultimately gives up on those partitions. — Worth noting that this exact issue has also been noticed in MB-44913 where such get_cluster_config timeouts were preventing handler to be paused. Michael Nitschinger  given that root cause in this case matches  MB-44913  assigned to you. Please let me know if further clarification is needed.
            abhishek.jindal Abhishek Jindal made changes -
            Component/s clients [ 10042 ]
            Component/s eventing [ 14026 ]
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Matt Ingenthron [ ingenthr ]
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Matt Ingenthron [ ingenthr ] Michael Nitschinger [ daschl ]
            jeelan.poola Jeelan Poola made changes -
            Labels volume-test affects-cc-testing volume-test
            jeelan.poola Jeelan Poola added a comment -

            Sergey Avseyev Since Michael Nitschinger is out this week, would be great if you can help with this one. This affects 7.0 eventing volume test. Thank you!

            jeelan.poola Jeelan Poola added a comment - Sergey Avseyev Since Michael Nitschinger is out this week, would be great if you can help with this one. This affects 7.0 eventing volume test. Thank you!

            Next action is with Sergey Avseyev. Pablo Silberkasten, if you could please track this one as well.

            ingenthr Matt Ingenthron added a comment - Next action is with Sergey Avseyev . Pablo Silberkasten , if you could please track this one as well.
            ingenthr Matt Ingenthron made changes -
            Assignee Michael Nitschinger [ daschl ] Sergey Avseyev [ avsej ]

            Jeelan Poola, as I wrote in MB-44913, it might be the library does not wait full 58 seconds for the operation to fail, could you re-run the test with this patch?
            http://review.couchbase.org/c/libcouchbase/+/149400

            avsej Sergey Avseyev added a comment - Jeelan Poola , as I wrote in MB-44913 , it might be the library does not wait full 58 seconds for the operation to fail, could you re-run the test with this patch? http://review.couchbase.org/c/libcouchbase/+/149400
            avsej Sergey Avseyev made changes -
            Assignee Sergey Avseyev [ avsej ] Jeelan Poola [ jeelan.poola ]

            Ankit Prabhu Vikas Chaudhary Could you please help with Toy and required testing?

            jeelan.poola Jeelan Poola added a comment - Ankit Prabhu Vikas Chaudhary Could you please help with Toy and required testing?
            jeelan.poola Jeelan Poola made changes -
            Assignee Jeelan Poola [ jeelan.poola ] Ankit Prabhu [ ankit.prabhu ]
            abhishek.jindal Abhishek Jindal added a comment - Vikas Chaudhary We would need to run the same test again with the same toy build mentioned in MB-44913 http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/12444/

            Abhishek Jindal we are blocked on deployment stuck. We hit the same with toy as well

            vikas.chaudhary Vikas Chaudhary added a comment - Abhishek Jindal  we are blocked on deployment stuck. We hit the same with toy as well
            ankit.prabhu Ankit Prabhu made changes -
            Assignee Ankit Prabhu [ ankit.prabhu ] Vikas Chaudhary [ vikas.chaudhary ]

            Abhishek Jindal can we have a toy with the fix of deployment hung as well to re-run the test?

            vikas.chaudhary Vikas Chaudhary added a comment - Abhishek Jindal  can we have a toy with the fix of deployment hung as well to re-run the test?
            vikas.chaudhary Vikas Chaudhary made changes -
            Assignee Vikas Chaudhary [ vikas.chaudhary ] Abhishek Jindal [ abhishek.jindal ]
            jeelan.poola Jeelan Poola added a comment - - edited

            Vikas Chaudhary Would be good to not wait for this as Query team is waiting for confirmation of whether query fixes are working or not for CAS mismatch issues.

            jeelan.poola Jeelan Poola added a comment - - edited Vikas Chaudhary Would be good to not wait for this as Query team is waiting for confirmation of whether query fixes are working or not for CAS mismatch issues.

            Vikas Chaudhary Please use the build at http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/12538/ . This has the fix for deployment hung issue + additional patch for debugging in LCB for get_cluster_config timeout.

            abhishek.jindal Abhishek Jindal added a comment - Vikas Chaudhary Please use the build at http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/12538/ . This has the fix for deployment hung issue + additional patch for debugging in LCB for get_cluster_config timeout.
            jeelan.poola Jeelan Poola made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Vikas Chaudhary [ vikas.chaudhary ]
            ritam.sharma Ritam Sharma made changes -
            Assignee Vikas Chaudhary [ vikas.chaudhary ] Jeelan Poola [ jeelan.poola ]
            ritam.sharma Ritam Sharma added a comment -

            Jeelan Poola - We don't have cycles to run volume testing at this time.

            ritam.sharma Ritam Sharma added a comment - Jeelan Poola - We don't have cycles to run volume testing at this time.
            jeelan.poola Jeelan Poola made changes -
            Assignee Jeelan Poola [ jeelan.poola ] Sergey Avseyev [ avsej ]
            pablo.silberkasten Pablo Silberkasten (Inactive) made changes -
            Assignee Sergey Avseyev [ avsej ] Jeelan Poola [ jeelan.poola ]

            Changing assignment until getting an update.

            pablo.silberkasten Pablo Silberkasten (Inactive) added a comment - Changing assignment until getting an update.
            jeelan.poola Jeelan Poola added a comment - - edited

            Pablo Silberkasten Would be great if you can work with QE directly. We are just an intermediary providing necessary Toys etc. . Thank you!

            Cc Ritam Sharma Chanabasappa Ghali from QE for help. As noted by Ritam earlier, they do not have cycles to repeat volume test at this point.

            jeelan.poola Jeelan Poola added a comment - - edited Pablo Silberkasten Would be great if you can work with QE directly. We are just an intermediary providing necessary Toys etc. . Thank you! Cc Ritam Sharma Chanabasappa Ghali from QE for help. As noted by Ritam earlier, they do not have cycles to repeat volume test at this point.
            jeelan.poola Jeelan Poola made changes -
            Assignee Jeelan Poola [ jeelan.poola ] Pablo Silberkasten [ JIRAUSER25235 ]
            pablo.silberkasten Pablo Silberkasten (Inactive) made changes -
            Assignee Pablo Silberkasten [ JIRAUSER25235 ] Vikas Chaudhary [ vikas.chaudhary ]
            ritam.sharma Ritam Sharma added a comment -

            Sorry Pablo Silberkasten = Please reproduce the issue on dev side and give us the final fix to be tested. As I mentioned earlier, there are no cycles for QE at this point of time.

            ritam.sharma Ritam Sharma added a comment - Sorry Pablo Silberkasten = Please reproduce the issue on dev side and give us the final fix to be tested. As I mentioned earlier, there are no cycles for QE at this point of time.
            ritam.sharma Ritam Sharma made changes -
            Assignee Vikas Chaudhary [ vikas.chaudhary ] Pablo Silberkasten [ JIRAUSER25235 ]
            arunkumar Arunkumar Senthilnathan added a comment - - edited Test running at http://172.23.109.231/job/centos-systest-launcher-3/91/console with toy build http://latestbuilds.service.couchbase.com/builds/latestbuilds/couchbase-server/toybuilds/12538/ - will keep monitoring CC Pablo Silberkasten Raju Suravarjjala Matt Ingenthron
            arunkumar Arunkumar Senthilnathan made changes -
            Assignee Pablo Silberkasten [ JIRAUSER25235 ] Abhishek Jindal [ abhishek.jindal ]

            Abhishek Jindal really appreciate the time and help - let me know if we need to run the test again in debug mode

            arunkumar Arunkumar Senthilnathan added a comment - Abhishek Jindal really appreciate the time and help - let me know if we need to run the test again in debug mode

            Separating libcouchbase & eventing issues with timers in volume test. We shall use MB-44913 to track libcouchbase issue. And use this ticket to track eventing side of the problem reported in the latest logs.

            jeelan.poola Jeelan Poola added a comment - Separating libcouchbase & eventing issues with timers in volume test. We shall use MB-44913 to track libcouchbase issue. And use this ticket to track eventing side of the problem reported in the latest logs.
            jeelan.poola Jeelan Poola made changes -
            Component/s eventing [ 14026 ]
            Component/s clients [ 10042 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment metadata_scope1_coll1_1.png [ 134635 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment metadata_scope1_coll1_2.png [ 134636 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment metadata_scope1_coll1_3.png [ 134637 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment metadata_scope1_coll1_4.png [ 134638 ]
            abhishek.jindal Abhishek Jindal made changes -
            abhishek.jindal Abhishek Jindal added a comment - - edited

            Adding some notes from the analysis done in the past few days :

            To start with - The problem seen in latest set of logs attached above, doesn't have any symptoms kv:get_cluster_config ops getting timed out. Hence the root cause, (while we aren't sure of it yet) is completely different compared than the first occurrence.

            As per the above set of logs, we do observe that all timer documents that were created were also fired. Note that towards the end of log collection there are 1152 docs in metadata bucket (1024 checkpointing docs + 128 timer partitions). No pending timers to fire:




            • That's 289 + 290 + 287 + 286 = 1152 docs towards the end of 19:55 log collection.
            • Also, no - timer create failures, dcp_parse_failures.
              This leaves us with only one speculation - Whether any sequences were dropped / not pushed to the c++ worker queue?

            The difference in document count was observed starting at 2021-04-01T19:53 PST onwards when :
            1. No KV node rebalance in or out took place - At this time KV nodes in cluster were .232, .244.245..25
            2. By this time (at 19:47:01) there was 1 eventing node - node .102 that was rebalanced in the cluster. Hence, eventing nodes in cluster post rebalance were - .90, .29, .75, .102

            The document count difference was observed only on KV nodes - .244 and .245.
            For this analysis, I'll take up node .245 where there was a difference of 243 documents.
            Out of 243 documents, vbucket 590, 585 and 581 on node .245 were the ones that had lesser documents in destination collection compared to src collection.
            Eventing consumer DCP streams for these vbuckets were always owned by node .90.
            So, we will be focussing only on the part whether all sequences for one of these vbuckets - say 585 didn't make it to the queue:

            Create operations from test on source collection completed between 10:43:40 and 19:44:40. No update / create operations on any vbucket after that. At this time, we can see that the CPP worker queue on node .90 for timers1 function was already full:

            Above, we see that CPP worker was busy draining its queue of size 120 MB at 19:44:22 which came down to 0 by 19:53:22 (9-10 minutes).
            Also worth noting that for 2 minutes straight after pushing the operation to src collection by 19:44:20, the CPP worker was hitting it's quota till 19:46:24:

            2021-04-01T19:46:24.269-07:00 [Info] DCPT[eventing:1YJqn9qz-61:{eventing:1YJqn9qz-60:worker_timers1_0_0_172.23.104.244:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2m44.980087329s (0.983113%)
            2021-04-01T19:46:24.269-07:00 [Info] DCPT[eventing:1YJqn9qz-57:{eventing:1YJqn9qz-56:worker_timers1_0_0_172.23.104.245:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2m44.980162398s (0.981598%)
            2021-04-01T19:46:29.269-07:00 [Info] DCPT[eventing:1YJqn9qz-57:{eventing:1YJqn9qz-56:worker_timers1_0_0_172.23.104.245:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2.991422135s (0.981982%)
            2021-04-01T19:46:29.269-07:00 [Info] DCPT[eventing:1YJqn9qz-57:{eventing:1YJqn9qz-56:worker_timers1_0_0_172.23.104.245:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2.991422135s (0.981982%)
            2021-04-01T19:46:29.270-07:00 [Info] DCPT[eventing:1YJqn9qz-61:{eventing:1YJqn9qz-60:worker_timers1_0_0_172.23.104.244:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2.992423304s (0.983452%)
            2021-04-01T19:46:38.268-07:00 [Info] DCPT[eventing:1YJqn9qz-61:{eventing:1YJqn9qz-60:worker_timers1_0_0_172.23.104.244:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 4.981072891s (0.983806%)
            

            Note that the queue was blocked for almost 2.5 minutes.
            It is possible that a streamend coming at that time (as part of rebalance) might've caused some operations to be dropped.

            Arunkumar Senthilnathan Vikas Chaudhary The overarching point of the entire analysis is:

            1. This looks like -ve test scenario (scenario where the user's cluster / quota is severly undersized). It should be worth testing out again but with a a good enough quota so that consumer doesn't get blocked for as much as 3 minutes.
            2. The main purpose here is to recreate the get_cluster_config timeouts by libcouchbase that were preventing the timers to be fired in a previous build.
            3. At the time this test was run, eventing was using 2.2.2 version of gocb which is proved to be unstable. Now that we have reverted to using 2.1.x, it is worth running this test again.

            abhishek.jindal Abhishek Jindal added a comment - - edited Adding some notes from the analysis done in the past few days : To start with - The problem seen in latest set of logs attached above, doesn't have any symptoms kv:get_cluster_config ops getting timed out. Hence the root cause, (while we aren't sure of it yet) is completely different compared than the first occurrence. As per the above set of logs, we do observe that all timer documents that were created were also fired. Note that towards the end of log collection there are 1152 docs in metadata bucket (1024 checkpointing docs + 128 timer partitions). No pending timers to fire: That's 289 + 290 + 287 + 286 = 1152 docs towards the end of 19:55 log collection. Also, no - timer create failures, dcp_parse_failures. This leaves us with only one speculation - Whether any sequences were dropped / not pushed to the c++ worker queue? — The difference in document count was observed starting at 2021-04-01T19:53 PST onwards when : 1. No KV node rebalance in or out took place - At this time KV nodes in cluster were .232, .244.245..25 2. By this time (at 19:47:01) there was 1 eventing node - node .102 that was rebalanced in the cluster . Hence, eventing nodes in cluster post rebalance were - .90, .29, .75, .102 The document count difference was observed only on KV nodes - .244 and .245. For this analysis, I'll take up node .245 where there was a difference of 243 documents. Out of 243 documents, vbucket 590, 585 and 581 on node .245 were the ones that had lesser documents in destination collection compared to src collection. Eventing consumer DCP streams for these vbuckets were always owned by node .90. So, we will be focussing only on the part whether all sequences for one of these vbuckets - say 585 didn't make it to the queue: Create operations from test on source collection completed between 10:43:40 and 19:44:40. No update / create operations on any vbucket after that. At this time, we can see that the CPP worker queue on node .90 for timers1 function was already full: Above, we see that CPP worker was busy draining its queue of size 120 MB at 19:44:22 which came down to 0 by 19:53:22 (9-10 minutes). Also worth noting that for 2 minutes straight after pushing the operation to src collection by 19:44:20, the CPP worker was hitting it's quota till 19:46:24: 2021-04-01T19:46:24.269-07:00 [Info] DCPT[eventing:1YJqn9qz-61:{eventing:1YJqn9qz-60:worker_timers1_0_0_172.23.104.244:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2m44.980087329s (0.983113%) 2021-04-01T19:46:24.269-07:00 [Info] DCPT[eventing:1YJqn9qz-57:{eventing:1YJqn9qz-56:worker_timers1_0_0_172.23.104.245:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2m44.980162398s (0.981598%) 2021-04-01T19:46:29.269-07:00 [Info] DCPT[eventing:1YJqn9qz-57:{eventing:1YJqn9qz-56:worker_timers1_0_0_172.23.104.245:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2.991422135s (0.981982%) 2021-04-01T19:46:29.269-07:00 [Info] DCPT[eventing:1YJqn9qz-57:{eventing:1YJqn9qz-56:worker_timers1_0_0_172.23.104.245:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2.991422135s (0.981982%) 2021-04-01T19:46:29.270-07:00 [Info] DCPT[eventing:1YJqn9qz-61:{eventing:1YJqn9qz-60:worker_timers1_0_0_172.23.104.244:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 2.992423304s (0.983452%) 2021-04-01T19:46:38.268-07:00 [Info] DCPT[eventing:1YJqn9qz-61:{eventing:1YJqn9qz-60:worker_timers1_0_0_172.23.104.244:11210_172.23.105.90:8096}/0] DCP-socket -> eventing blocked 4.981072891s (0.983806%) Note that the queue was blocked for almost 2.5 minutes. It is possible that a streamend coming at that time (as part of rebalance) might've caused some operations to be dropped. — Arunkumar Senthilnathan Vikas Chaudhary The overarching point of the entire analysis is: 1. This looks like -ve test scenario (scenario where the user's cluster / quota is severly undersized). It should be worth testing out again but with a a good enough quota so that consumer doesn't get blocked for as much as 3 minutes. 2. The main purpose here is to recreate the get_cluster_config timeouts by libcouchbase that were preventing the timers to be fired in a previous build. 3. At the time this test was run, eventing was using 2.2.2 version of gocb which is proved to be unstable. Now that we have reverted to using 2.1.x, it is worth running this test again.
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Vikas Chaudhary [ vikas.chaudhary ]
            arunkumar Arunkumar Senthilnathan added a comment - - edited

            Abhishek Jindal I kicked off the test against 7.0.0-4897 and the issue is still observed:

            [2021-04-08T12:15:08-07:00, sequoiatools/eventing:7.0:9826f3] eventing_helper.py -i 172.23.105.109 -u Administrator -p password -o verify -s bucket_op.scope_0.coll0 -d bucket_op.scope_1.coll0
             
            Error occurred on container - sequoiatools/eventing:7.0:[eventing_helper.py -i 172.23.105.109 -u Administrator -p password -o verify -s bucket_op.scope_0.coll0 -d bucket_op.scope_1.coll0]
             
            docker logs 9826f3
            docker start 9826f3
             
            O{'username': 'Administrator', 'log_level': 'INFO', 'name': None, 'state': None, 'number': 1, 'metadata': None, 'source': 'bucket_op.scope_0.coll0', 'host': '172.23.105.109', 'sleep': 60, 'timeout': 1200, 'bindings': ['bucket_op.scope_1.coll0'], 'password': 'password', 'type': None, 'operation': 'verify', 'sbm': False, 'wait': False}
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1220387, Destination Bucket(['bucket_op.scope_1.coll0']) : 131051
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1296176, Destination Bucket(['bucket_op.scope_1.coll0']) : 168286
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1483290, Destination Bucket(['bucket_op.scope_1.coll0']) : 285311
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1625547, Destination Bucket(['bucket_op.scope_1.coll0']) : 361436
            #Traceback (most recent call last):
            3  File "eventing_helper.py", line 317, in <module>
                EventingHelper().run()
            -  File "eventing_helper.py", line 67, in run
                self.verify_doc(options)
            5  File "eventing_helper.py", line 314, in verify_doc
            p    format(source_collection, source_count["results"][0], destination_collection, binding_count["results"][0]))
            �Exception: No of docs in source and destination don't match: Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']): 2140821
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1716089, Destination Bucket(['bucket_op.scope_1.coll0']) : 402566
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1820008, Destination Bucket(['bucket_op.scope_1.coll0']) : 457524
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1951824, Destination Bucket(['bucket_op.scope_1.coll0']) : 508936
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2046427, Destination Bucket(['bucket_op.scope_1.coll0']) : 531105
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2183567, Destination Bucket(['bucket_op.scope_1.coll0']) : 569113
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2301100, Destination Bucket(['bucket_op.scope_1.coll0']) : 597927
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2380236, Destination Bucket(['bucket_op.scope_1.coll0']) : 635029
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 676556
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 783718
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 924686
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1102666
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1261516
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1400633
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1580987
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1775112
            �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1959983
            

            Job is running here: http://172.23.109.231/job/centos-systest-launcher-3/108/console

            Vikas Chaudhary I have triggered eagle-eye here: http://172.23.109.231/job/system_test_log_parser/1199/console - please work with Abhishek Jindal and give him the logs needed

            arunkumar Arunkumar Senthilnathan added a comment - - edited Abhishek Jindal I kicked off the test against 7.0.0-4897 and the issue is still observed: [2021-04-08T12:15:08-07:00, sequoiatools/eventing:7.0:9826f3] eventing_helper.py -i 172.23.105.109 -u Administrator -p password -o verify -s bucket_op.scope_0.coll0 -d bucket_op.scope_1.coll0 →   Error occurred on container - sequoiatools/eventing:7.0:[eventing_helper.py -i 172.23.105.109 -u Administrator -p password -o verify -s bucket_op.scope_0.coll0 -d bucket_op.scope_1.coll0]   docker logs 9826f3 docker start 9826f3   O{'username': 'Administrator', 'log_level': 'INFO', 'name': None, 'state': None, 'number': 1, 'metadata': None, 'source': 'bucket_op.scope_0.coll0', 'host': '172.23.105.109', 'sleep': 60, 'timeout': 1200, 'bindings': ['bucket_op.scope_1.coll0'], 'password': 'password', 'type': None, 'operation': 'verify', 'sbm': False, 'wait': False} �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1220387, Destination Bucket(['bucket_op.scope_1.coll0']) : 131051 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1296176, Destination Bucket(['bucket_op.scope_1.coll0']) : 168286 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1483290, Destination Bucket(['bucket_op.scope_1.coll0']) : 285311 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1625547, Destination Bucket(['bucket_op.scope_1.coll0']) : 361436 #Traceback (most recent call last): 3 File "eventing_helper.py", line 317, in <module>  EventingHelper().run() - File "eventing_helper.py", line 67, in run  self.verify_doc(options) 5 File "eventing_helper.py", line 314, in verify_doc p format(source_collection, source_count["results"][0], destination_collection, binding_count["results"][0])) �Exception: No of docs in source and destination don't match: Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']): 2140821 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1716089, Destination Bucket(['bucket_op.scope_1.coll0']) : 402566 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1820008, Destination Bucket(['bucket_op.scope_1.coll0']) : 457524 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 1951824, Destination Bucket(['bucket_op.scope_1.coll0']) : 508936 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2046427, Destination Bucket(['bucket_op.scope_1.coll0']) : 531105 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2183567, Destination Bucket(['bucket_op.scope_1.coll0']) : 569113 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2301100, Destination Bucket(['bucket_op.scope_1.coll0']) : 597927 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2380236, Destination Bucket(['bucket_op.scope_1.coll0']) : 635029 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 676556 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 783718 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 924686 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1102666 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1261516 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1400633 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1580987 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1775112 �No of docs in source and destination : Source Bucket(bucket_op.scope_0.coll0) : 2415062, Destination Bucket(['bucket_op.scope_1.coll0']) : 1959983 Job is running here: http://172.23.109.231/job/centos-systest-launcher-3/108/console Vikas Chaudhary I have triggered eagle-eye here: http://172.23.109.231/job/system_test_log_parser/1199/console - please work with Abhishek Jindal and give him the logs needed
            abhishek.jindal Abhishek Jindal added a comment - - edited

            Arunkumar SenthilnathanVikas Chaudhary - The snippet attached above has difference in document count reported for bucket_op handler. I would suggest using this MB solely for timers debugging. Worth opening up a new MB for bucket_ops.

            Secondly, the first time I notice a difference in timer_op document count was at 2021-04-08T14:18:33 PST.
            Looking at logs at this time, numbers on KV nodes - .232, .25, .86 match up perfectly. Hence, it's evident that this difference was coming from KV nodes .244 and .245 (as seen before).

            Edit - I believe that it is worth adding the capability to run cbcollect from failed over / rebalanced out nodes to the test framework as those nodes have useful logging too.

            These 2 KV nodes have been rebalanced out at 2021-04-08T17:05:15 PST, hence current log collection do not have these nodes.

            [2021-04-08T17:05:15-07:00, sequoiatools/couchbase-cli:7.0:95f8be] rebalance -c 172.23.104.232:8091 --server-remove 172.23.104.244,172.23.104.245 -u Administrator -p password
            

            In order to debug further, we need logs from KV nodes .244 and .245 having logging around 2021-04-08T14:18:33 PST so that I can see whether it's the same vbuckets again for which the difference is coming in?

            abhishek.jindal Abhishek Jindal added a comment - - edited Arunkumar Senthilnathan Vikas Chaudhary - The snippet attached above has difference in document count reported for bucket_op handler. I would suggest using this MB solely for timers debugging. Worth opening up a new MB for bucket_ops. Secondly, the first time I notice a difference in timer_op document count was at 2021-04-08T14:18:33 PST. Looking at logs at this time, numbers on KV nodes - .232, .25, .86 match up perfectly. Hence, it's evident that this difference was coming from KV nodes .244 and .245 (as seen before). Edit - I believe that it is worth adding the capability to run cbcollect from failed over / rebalanced out nodes to the test framework as those nodes have useful logging too. These 2 KV nodes have been rebalanced out at 2021-04-08T17:05:15 PST, hence current log collection do not have these nodes. [2021-04-08T17:05:15-07:00, sequoiatools/couchbase-cli:7.0:95f8be] rebalance -c 172.23.104.232:8091 --server-remove 172.23.104.244,172.23.104.245 -u Administrator -p password In order to debug further, we need logs from KV nodes .244 and .245 having logging around 2021-04-08T14:18:33 PST so that I can see whether it's the same vbuckets again for which the difference is coming in?
            ritam.sharma Ritam Sharma made changes -
            Assignee Vikas Chaudhary [ vikas.chaudhary ] Sujay Gad [ JIRAUSER25279 ]
            sujay.gad Sujay Gad added a comment - - edited

            Abhishek Jindal still seeing data mismatch in case of timers after running the test at a lower scale. PFA latest logs. https://supportal.couchbase.com/snapshot/b08351820ce5140bc0e12e93d5294671::0

            sujay.gad Sujay Gad added a comment - - edited Abhishek Jindal still seeing data mismatch in case of timers after running the test at a lower scale. PFA latest logs. https://supportal.couchbase.com/snapshot/b08351820ce5140bc0e12e93d5294671::0
            sujay.gad Sujay Gad made changes -
            Assignee Sujay Gad [ JIRAUSER25279 ] Abhishek Jindal [ abhishek.jindal ]

            Discussed with Sujay Gad via a call - Given that the issue is not readily reproducible in my local environment, in order to eliminate further variables it would be great if we can scale down the volume test by:

            1. Removing the step regarding eventing node rebalance in.
            2. Let's push new documents load only against the source collection for one of the timers handler.

            A live cluster + cbcollect at the end of the test if the docs count still do not match will be helpful.

            Thanks!

            abhishek.jindal Abhishek Jindal added a comment - Discussed with Sujay Gad via a call - Given that the issue is not readily reproducible in my local environment, in order to eliminate further variables it would be great if we can scale down the volume test by: 1. Removing the step regarding eventing node rebalance in. 2. Let's push new documents load only against the source collection for one of the timers handler. A live cluster + cbcollect at the end of the test if the docs count still do not match will be helpful. Thanks!
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Sujay Gad [ JIRAUSER25279 ]
            ritam.sharma Ritam Sharma added a comment - - edited

            Abhishek Jindal - We have already scaled down the test to just 3k ops/sec for the cluster. Logs are available to review. Test is already in its simplest form and this is first iteration.
            https://issues.couchbase.com/secure/attachment/135057/MB-44784.rtf

            Create 9 node cluster ( Eventing:3 , Kv:4,index:1,query:1)
            Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl)
            deploy bucket op, timers and N1ql
            Rebalance in 1 eventing node
            Start CRUD on bucket_op where no handler are deployed
            Verify handlers

            ritam.sharma Ritam Sharma added a comment - - edited Abhishek Jindal - We have already scaled down the test to just 3k ops/sec for the cluster. Logs are available to review. Test is already in its simplest form and this is first iteration. https://issues.couchbase.com/secure/attachment/135057/MB-44784.rtf Create 9 node cluster ( Eventing:3 , Kv:4,index:1,query:1) Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl) deploy bucket op, timers and N1ql Rebalance in 1 eventing node Start CRUD on bucket_op where no handler are deployed Verify handlers
            ritam.sharma Ritam Sharma made changes -
            Assignee Sujay Gad [ JIRAUSER25279 ] Abhishek Jindal [ abhishek.jindal ]
            ritam.sharma Ritam Sharma made changes -
            Attachment MB-44784.rtf [ 135057 ]
            abhishek.jindal Abhishek Jindal made changes -
            Link This issue depends on MB-45599 [ MB-45599 ]

            While the investigation for the later volume test runs is still underway, raised MB-45599 with ns_server which triggered a continuous stream of get_cluster_config timeouts.

            abhishek.jindal Abhishek Jindal added a comment - While the investigation for the later volume test runs is still underway, raised MB-45599 with ns_server which triggered a continuous stream of get_cluster_config timeouts.

            Ritam Sharma As discussed yesterday, given that we have KV fixes for MB-45670 and MB-45599 in latest builds, can we run this volume test again? This run will eliminate those 2 variables (hopefully making debugging easier). If possible, a similar scaled down test with the following will be helpful:

            1. Create 9 node cluster ( Eventing:3 , Kv:4,index:1,query:1)
            2. Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl)
            3. deploy bucket op, timers and N1ql
            4. Rebalance in 1 eventing node
            5. Collect logs if there's a mismatch

            Meanwhile, in case we still see a mismatch, will work on a toy with more logging / stats.

            Thanks!

            abhishek.jindal Abhishek Jindal added a comment - Ritam Sharma As discussed yesterday, given that we have KV fixes for MB-45670 and MB-45599 in latest builds, can we run this volume test again? This run will eliminate those 2 variables (hopefully making debugging easier). If possible, a similar scaled down test with the following will be helpful: 1. Create 9 node cluster ( Eventing:3 , Kv:4,index:1,query:1) 2. Create 15 handlers (3 of bucket op, timer,n1ql, SBM, curl) 3. deploy bucket op, timers and N1ql 4. Rebalance in 1 eventing node 5. Collect logs if there's a mismatch Meanwhile, in case we still see a mismatch, will work on a toy with more logging / stats. Thanks!
            sujay.gad Sujay Gad added a comment - - edited

            Issue not observed after latest run on 7.0.0-4955. Raised MB-45713 after latest run.

            sujay.gad Sujay Gad added a comment - - edited Issue not observed after latest run on 7.0.0-4955. Raised MB-45713  after latest run.
            sujay.gad Sujay Gad made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            sujay.gad Sujay Gad made changes -
            Link This issue relates to MB-45713 [ MB-45713 ]
            abhishek.jindal Abhishek Jindal made changes -
            Link This issue relates to MB-45713 [ MB-45713 ]
            sujay.gad Sujay Gad added a comment - Issue was observed again after run latest run with 7.0.0-4983. Logs Attached below. https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.105.186.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.105.190.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.106.154.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.106.242.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.106.243.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.106.253.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.106.255.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.107.89.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.97.213.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/volume-test-MB-44784/collectinfo-2021-04-20T143009-ns_1%40172.23.97.214.zip
            sujay.gad Sujay Gad made changes -
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]
            abhishek.jindal Abhishek Jindal added a comment - - edited

            Sujay Gad for which handler / bucker-scope-collection do you see a mismatch? Looking at the logs, the document count for timer_op bucket is matching across all nodes.

            abhishek.jindal Abhishek Jindal added a comment - - edited Sujay Gad for which handler / bucker-scope-collection do you see a mismatch? Looking at the logs, the document count for timer_op bucket is matching across all nodes.
            sujay.gad Sujay Gad added a comment - - edited

            Mismatch is seen for timer handler.

            Exception: No of docs in source and destination don't match: Source Bucket(timer_op.scope_0.coll1) : 5322632, Destination Bucket(['timer_op.scope_1.coll1']): 5322630
            

             

            sujay.gad Sujay Gad added a comment - - edited Mismatch is seen for timer handler. Exception: No of docs in source and destination don 't match: Source Bucket(timer_op.scope_0.coll1) : 5322632, Destination Bucket([' timer_op.scope_1.coll1']): 5322630  
            jeelan.poola Jeelan Poola made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Vinayaka Kamath [ JIRAUSER25010 ]

            There are BUCKET_NOT_FOUND error which resulted into mismatch of events in source and destination buckets.

            cbcollect_info_ns_1@172.23.105.186_20210420-143009/ns_server.eventing.log:404920:2021-04-20T04:24:22.432-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3996818802.sock:96210] [lcb,server L:1108 I:3945003376] <172.23.106.243:11210> (CTX=0x7fb8cc0bace0,memcached,SRV=0x7fb8d01caf70,IX=2) Got socket error LCB_ERR_SOCKET_SHUTDOWN (1025)
            cbcollect_info_ns_1@172.23.105.186_20210420-143009/ns_server.eventing.log:406840:2021-04-20T04:24:28.927-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3996818802.sock:96210] [lcb,server L:1108 I:4158424551] <172.23.106.243:11210> (CTX=0x7fb8c40cc410,memcached,SRV=0x7fb8d01de720,IX=2) Got socket error LCB_ERR_SOCKET_SHUTDOWN (1025)
            cbcollect_info_ns_1@172.23.105.186_20210420-143009/ns_server.eventing.log:406849:2021-04-20T04:24:28.949-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3996818802.sock:96210] [lcb,server L:961 I:4158424551] <NOHOST:NOPORT> (CTX=(nil),,SRV=0x7fb8d01de720,IX=2) Connection attempt failed. Received LCB_ERR_BUCKET_NOT_FOUND (210) from libcouchbase, received 0 from operating system
            

            Related to MB-44324

            ankit.prabhu Ankit Prabhu added a comment - There are BUCKET_NOT_FOUND error which resulted into mismatch of events in source and destination buckets. cbcollect_info_ns_1@172.23.105.186_20210420-143009/ns_server.eventing.log:404920:2021-04-20T04:24:22.432-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3996818802.sock:96210] [lcb,server L:1108 I:3945003376] <172.23.106.243:11210> (CTX=0x7fb8cc0bace0,memcached,SRV=0x7fb8d01caf70,IX=2) Got socket error LCB_ERR_SOCKET_SHUTDOWN (1025) cbcollect_info_ns_1@172.23.105.186_20210420-143009/ns_server.eventing.log:406840:2021-04-20T04:24:28.927-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3996818802.sock:96210] [lcb,server L:1108 I:4158424551] <172.23.106.243:11210> (CTX=0x7fb8c40cc410,memcached,SRV=0x7fb8d01de720,IX=2) Got socket error LCB_ERR_SOCKET_SHUTDOWN (1025) cbcollect_info_ns_1@172.23.105.186_20210420-143009/ns_server.eventing.log:406849:2021-04-20T04:24:28.949-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3996818802.sock:96210] [lcb,server L:961 I:4158424551] <NOHOST:NOPORT> (CTX=(nil),,SRV=0x7fb8d01de720,IX=2) Connection attempt failed. Received LCB_ERR_BUCKET_NOT_FOUND (210) from libcouchbase, received 0 from operating system Related to MB-44324
            jeelan.poola Jeelan Poola made changes -
            Link This issue depends on MB-44324 [ MB-44324 ]
            lynn.straus Lynn Straus made changes -
            Due Date 27/Apr/21
            jeelan.poola Jeelan Poola made changes -
            Assignee Vinayaka Kamath [ JIRAUSER25010 ] Abhishek Jindal [ abhishek.jindal ]
            jeelan.poola Jeelan Poola made changes -
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Resolved [ 5 ]

            Not seen on 7.0.0 build 5046 

            vikas.chaudhary Vikas Chaudhary added a comment - Not seen on 7.0.0 build 5046 
            vikas.chaudhary Vikas Chaudhary made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            sujay.gad Sujay Gad made changes -
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]
            abhishek.jindal Abhishek Jindal made changes -
            Due Date 27/Apr/21 07/May/21

            Could it be related with recently found issue when the document is getting written to default collection when original collection/scope has been dropped? Fixed as CCBC-1400.

            avsej Sergey Avseyev added a comment - Could it be related with recently found issue when the document is getting written to default collection when original collection/scope has been dropped? Fixed as CCBC-1400 .
            abhishek.jindal Abhishek Jindal made changes -
            Attachment timer_mismatch.png [ 139425 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment metadata_items.png [ 139428 ]
            abhishek.jindal Abhishek Jindal made changes -
            Attachment metadata_items-1.png [ 139429 ]

            In this case I do not see the collection getting dropped. So pretty sure that CCBC-1400 is not the cause here.
            From the logs the mismatch of 2 docs started off somewhere between 2021-05-04T02:45:20 and 2021-05-04T03:05:20 on node 214, vbucket 289:

            Documents on src collection = 5424 while on dst collection : 5422.
            At this time vb 289 was owned by eventing node .186. Additionally at 2021-05-04T02:45:20 2 eventing nodes .182 and .185 were being rebalanced out. This rebalance lasted a minute at max - ended by 2021-05-04T02:46:54.

            By 2021-05-04T02:45:58 we got streamend for this vb. At this time seq no - 7382 was the last sent sequence number from Go producer side to CPP consumer side:

            2021-05-04T02:45:58.400-07:00 [Info] Consumer::processDCPEvents [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 got STREAMEND
            2021-05-04T02:45:58.400-07:00 [Info] Consumer::sendVbFilterData [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7382 sending filter data to C++
            

            CPP side acknowledged that out of 7382, 7354 is the highest that has been processed:

            2021-05-04T02:45:58.498-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7354 skip_ack: 0 sending filter ack to Go
            

            A filter is now set for 7354. Consumer notices that vb: 289 should still be owned by same consumer so DCP stream should be reclaimed. Hence, opens up a new stream:

            2021-05-04T02:46:00.818-07:00 [Info] Consumer::dcpRequestStreamHandle [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 DCP stream start vbKvAddr: 172.23.97.214:11210 vbuuid: 60405539196205 startSeq !metadata_items.png|thumbnail! : 7354 snapshotStart: 7354 snapshotEnd: 7354 Manifest id: c
            

            This confirms no mutation has been dropped as part of filtering.
            Metadata collection for this handler suggests no timers are remaining to be fired. Note the numbers before and after markers:

            Only possibility is that the test maybe adding some documents that are duplicates.
            It would help if:

            • Before deploying the handler "app_log_max_size" is set to a large value such as 500 MB (524288000 bytes) so that any application logs do not rotate.
            • Deploy a dummy handler which listens on same source collection and just logs "log("Doc created/updated", meta.id);" in the handler code.
            • If we observe a mismatch by the end of test, cbcollect + application logs from all eventing nodes will help narrowing down further.
            abhishek.jindal Abhishek Jindal added a comment - In this case I do not see the collection getting dropped. So pretty sure that CCBC-1400 is not the cause here. From the logs the mismatch of 2 docs started off somewhere between 2021-05-04T02:45:20 and 2021-05-04T03:05:20 on node 214, vbucket 289: Documents on src collection = 5424 while on dst collection : 5422. At this time vb 289 was owned by eventing node .186. Additionally at 2021-05-04T02:45:20 2 eventing nodes .182 and .185 were being rebalanced out. This rebalance lasted a minute at max - ended by 2021-05-04T02:46:54 . By 2021-05-04T02:45:58 we got streamend for this vb. At this time seq no - 7382 was the last sent sequence number from Go producer side to CPP consumer side: 2021-05-04T02:45:58.400-07:00 [Info] Consumer::processDCPEvents [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 got STREAMEND 2021-05-04T02:45:58.400-07:00 [Info] Consumer::sendVbFilterData [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7382 sending filter data to C++ CPP side acknowledged that out of 7382, 7354 is the highest that has been processed: 2021-05-04T02:45:58.498-07:00 [Info] eventing-consumer [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 seqNo: 7354 skip_ack: 0 sending filter ack to Go A filter is now set for 7354. Consumer notices that vb: 289 should still be owned by same consumer so DCP stream should be reclaimed. Hence, opens up a new stream: 2021-05-04T02:46:00.818-07:00 [Info] Consumer::dcpRequestStreamHandle [worker_timers0_0_0:/tmp/127.0.0.1:8091_0_3656683659.sock:1383] vb: 289 DCP stream start vbKvAddr: 172.23.97.214:11210 vbuuid: 60405539196205 startSeq !metadata_items.png|thumbnail! : 7354 snapshotStart: 7354 snapshotEnd: 7354 Manifest id: c This confirms no mutation has been dropped as part of filtering. Metadata collection for this handler suggests no timers are remaining to be fired. Note the numbers before and after markers: Only possibility is that the test maybe adding some documents that are duplicates. It would help if: Before deploying the handler "app_log_max_size" is set to a large value such as 500 MB (524288000 bytes) so that any application logs do not rotate. Deploy a dummy handler which listens on same source collection and just logs "log("Doc created/updated", meta.id);" in the handler code. If we observe a mismatch by the end of test, cbcollect + application logs from all eventing nodes will help narrowing down further.
            abhishek.jindal Abhishek Jindal made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Sujay Gad [ JIRAUSER25279 ]
            abhishek.jindal Abhishek Jindal made changes -
            Due Date 07/May/21 14/May/21
            ritam.sharma Ritam Sharma added a comment -

            QE Update - Test is running with data requested by Dev -
            http://qa.sc.couchbase.com/job/component_systest_launcher_2/317/ - Running for 5 hours, not reproduced on - 7.0.0-5127

            ritam.sharma Ritam Sharma added a comment - QE Update - Test is running with data requested by Dev - http://qa.sc.couchbase.com/job/component_systest_launcher_2/317/ - Running for 5 hours, not reproduced on - 7.0.0-5127
            raju Raju Suravarjjala made changes -
            Due Date 14/May/21 10/May/21
            ritam.sharma Ritam Sharma added a comment - - edited Update - Build - 7.0.0-5127 - Cannot reproduce the issue - http://qa.sc.couchbase.com/job/component_systest_launcher_2/317/consoleFull - Run log here - https://issues.couchbase.com/secure/attachment/139949/RunLogs.txt
            ritam.sharma Ritam Sharma made changes -
            Attachment RunLogs.txt [ 139949 ]
            ritam.sharma Ritam Sharma made changes -
            Assignee Sujay Gad [ JIRAUSER25279 ] Abhishek Jindal [ abhishek.jindal ]
            abhishek.jindal Abhishek Jindal made changes -
            Due Date 10/May/21 14/May/21

            Resolving this as a dup of MB-45967 as these share the same cause. Once the fix is in for MB-45967 would request a rerun of the volume test for timers handler to verify.

            abhishek.jindal Abhishek Jindal added a comment - Resolving this as a dup of MB-45967 as these share the same cause. Once the fix is in for MB-45967 would request a rerun of the volume test for timers handler to verify.
            abhishek.jindal Abhishek Jindal made changes -
            Link This issue relates to MB-45967 [ MB-45967 ]
            abhishek.jindal Abhishek Jindal made changes -
            Resolution Duplicate [ 3 ]
            Status Reopened [ 4 ] Resolved [ 5 ]

            Closing all non-fixed issues

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Closing all non-fixed issues
            mihir.kamdar Mihir Kamdar (Inactive) made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]

            People

              abhishek.jindal Abhishek Jindal
              vikas.chaudhary Vikas Chaudhary
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty