Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53115

Index rollbackAllToZero messages seen

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Steps to reproduce:

      1. Create 4 buckets 
      2. Create indexes with replicas on each of the 4 buckets.
      3. Run pillowfight to continuously load data ((buckets have 1M, 1M , 1M and 3M items). The bucket RR needs to be under 10%. Load until then
      4. Run a shell script that runs the request_plus scans continuously.
      5. Run stress-ng  with the params -> stress-ng --vm 4 --vm-bytes 1G --metrics-brief --vm-keep --vm-locked -m 4 --aggressive --vm-populate.  (Adjust the --vm-bytes param depending upon the VM resources)
      6. Once you run enough stress-ng processes, OOM kill will kick in. This can be verified by checking the dmesg ( dmesg -T | egrep -i 'killed process' )
      7. There's a possibility that stress-ng gets spawned and killed since OOM kill is determined by a oom_score_adj factor. In order to make sure that memcached gets killed run this 

      echo 1000 > /proc/<memcached PID>/oom_score_adj 

      1. Observe that the scans are timing out and that the index has rolled back to 0.

       

      Logs and pcaps are attached. There were 2 instances of rollbacks observed. Please use these timestamps for analysis

      Instance 1

       

      Index node1 ( 172.23.106.159) 
      2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8
       
      Index node2 ( 172.23.106.163) 
       
      2022-07-26T03:26:58.186-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

      Instance 2

       

       

      Index node1 ( 172.23.106.159) 
       
      2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8
       
       Index node2 ( 172.23.106.163) 
       
      2022-07-26T05:06:10.805-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8

       

       

       

      Log bundles -> 

      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.105.36.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.105.37.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.156.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.159.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.163.zip
      s3://cb-customers-secure/cbse12279oomkil2/2022-07-26/collectinfo-2022-07-26t125548-ns_1@172.23.106.204.zip

       

       

      I have the packet capture files too and can attach them if necessary. 

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          dfinlay Dave Finlay added a comment -

          I think the case where the bucket got flushed and Indexer had to rollback to zero is completely fine.

          The is when the bucket flush starts:

          [ns_server:debug,2022-07-26T03:26:48.768-07:00,ns_1@172.23.105.36:ns_audit<0.28196.0>:ns_audit:handle_call:146]Audit flush_bucket: [{remote,{[{ip,<<"10.100.195.195">>},{port,49358}]}},
                               {sessionid,<<"634890a73e552079485242dc299216a91c34cdda">>},
                               {real_userid,{[{domain,builtin},
                                              {user,<<"<ud>Administrator</ud>">>}]}},
                               {timestamp,<<"2022-07-26T03:26:48.768-07:00">>},
                               {bucket_name,<<"test8">>}]
          

          This is when it finishes:

          [ns_server:debug,2022-07-26T03:26:48.768-07:00,ns_1@172.23.105.36:ns_audit<0.28196.0>:ns_audit:handle_call:146]Audit flush_bucket: [{remote,{[{ip,<<"10.100.195.195">>},{port,49358}]}},
                               {sessionid,<<"634890a73e552079485242dc299216a91c34cdda">>},
                               {real_userid,{[{domain,builtin},
                                              {user,<<"<ud>Administrator</ud>">>}]}},
                               {timestamp,<<"2022-07-26T03:26:48.768-07:00">>},
                               {bucket_name,<<"test8">>}]
          

          Flushs empties the bucket by deleting the vbucket which results in the creation of a new vbuuid. It's not surprising that Indexing rolls back to zero a few seconds after the bucket flush finishes:

          Index node1 ( 172.23.106.159) 
          2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8
          

          The second case at 5:06 is more interesting as it is the result of a failover induced by an OOM kill.

          OOM kill:

          2022-07-26T05:02:58.678-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Control connection to memcached on 'ns_1@172.23.105.36' disconnected. Check logs for details.
          2022-07-26T05:03:00.548-07:00, ns_log:0:info:message(ns_1@172.23.105.36) - Service 'memcached' exited with status 137. Restarting. Messages:
          

          Failover:

          2022-07-26T05:05:02.810-07:00, ns_orchestrator:0:info:message(ns_1@172.23.105.36) - Starting failover of nodes ['ns_1@172.23.105.36']. Operation Id = 2139917b8f5387cec83fac633b7e7b5a
          2022-07-26T05:05:05.290-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Shutting down bucket "test8" on 'ns_1@172.23.105.36' for deletion
          2022-07-26T05:05:07.683-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Bucket "test6" loaded on node 'ns_1@172.23.105.36' in 124 seconds.
          2022-07-26T05:05:08.608-07:00, failover:0:info:message(ns_1@172.23.105.36) - Failed over ['ns_1@172.23.105.36']: ok
          2022-07-26T05:05:08.651-07:00, failover:0:info:message(ns_1@172.23.105.36) - Deactivating failed over nodes ['ns_1@172.23.105.36']
          2022-07-26T05:05:08.690-07:00, auto_failover:0:info:message(ns_1@172.23.105.36) - Node ('ns_1@172.23.105.36') was automatically failed over. Reason: The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service.
          

          Rollback to zero:

          Index node1 ( 172.23.106.159) 
          2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8
          

          CPU is high on all 3 of the memcached nodes for the last 10 minutes before the failover – however it doesn't seem to be that high prior to that.

          That said, I see quite a lot of slow scheduling for the DCP connection notifier task on the non-IO thread workers during this time.

          $ grep "T0[345].*test8.*Slow scheduling for NON_IO task .DCP connection" memcached.log | grep -o 2022-..-..T..:. | uniq -c
            50 2022-07-26T03:0
            13 2022-07-26T03:1
            14 2022-07-26T03:2
            87 2022-07-26T03:3
            19 2022-07-26T03:4
            65 2022-07-26T03:5
            79 2022-07-26T04:0
            41 2022-07-26T04:1
            13 2022-07-26T04:2
            22 2022-07-26T04:3
            45 2022-07-26T04:4
            27 2022-07-26T04:5
            33 2022-07-26T05:0
          

          Some of these waits are pretty lengthy. Dave - is this evidence for KV being slow to get mutations to indexing? Is there a reason for the slowness?

          dfinlay Dave Finlay added a comment - I think the case where the bucket got flushed and Indexer had to rollback to zero is completely fine. The is when the bucket flush starts: [ns_server:debug,2022-07-26T03:26:48.768-07:00,ns_1@172.23.105.36:ns_audit<0.28196.0>:ns_audit:handle_call:146]Audit flush_bucket: [{remote,{[{ip,<<"10.100.195.195">>},{port,49358}]}}, {sessionid,<<"634890a73e552079485242dc299216a91c34cdda">>}, {real_userid,{[{domain,builtin}, {user,<<"<ud>Administrator</ud>">>}]}}, {timestamp,<<"2022-07-26T03:26:48.768-07:00">>}, {bucket_name,<<"test8">>}] This is when it finishes: [ns_server:debug,2022-07-26T03:26:48.768-07:00,ns_1@172.23.105.36:ns_audit<0.28196.0>:ns_audit:handle_call:146]Audit flush_bucket: [{remote,{[{ip,<<"10.100.195.195">>},{port,49358}]}}, {sessionid,<<"634890a73e552079485242dc299216a91c34cdda">>}, {real_userid,{[{domain,builtin}, {user,<<"<ud>Administrator</ud>">>}]}}, {timestamp,<<"2022-07-26T03:26:48.768-07:00">>}, {bucket_name,<<"test8">>}] Flushs empties the bucket by deleting the vbucket which results in the creation of a new vbuuid. It's not surprising that Indexing rolls back to zero a few seconds after the bucket flush finishes: Index node1 ( 172.23.106.159) 2022-07-26T03:26:54.738-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8 The second case at 5:06 is more interesting as it is the result of a failover induced by an OOM kill. OOM kill: 2022-07-26T05:02:58.678-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Control connection to memcached on 'ns_1@172.23.105.36' disconnected. Check logs for details. 2022-07-26T05:03:00.548-07:00, ns_log:0:info:message(ns_1@172.23.105.36) - Service 'memcached' exited with status 137. Restarting. Messages: Failover: 2022-07-26T05:05:02.810-07:00, ns_orchestrator:0:info:message(ns_1@172.23.105.36) - Starting failover of nodes ['ns_1@172.23.105.36']. Operation Id = 2139917b8f5387cec83fac633b7e7b5a 2022-07-26T05:05:05.290-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Shutting down bucket "test8" on 'ns_1@172.23.105.36' for deletion 2022-07-26T05:05:07.683-07:00, ns_memcached:0:info:message(ns_1@172.23.105.36) - Bucket "test6" loaded on node 'ns_1@172.23.105.36' in 124 seconds. 2022-07-26T05:05:08.608-07:00, failover:0:info:message(ns_1@172.23.105.36) - Failed over ['ns_1@172.23.105.36']: ok 2022-07-26T05:05:08.651-07:00, failover:0:info:message(ns_1@172.23.105.36) - Deactivating failed over nodes ['ns_1@172.23.105.36'] 2022-07-26T05:05:08.690-07:00, auto_failover:0:info:message(ns_1@172.23.105.36) - Node ('ns_1@172.23.105.36') was automatically failed over. Reason: The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. Rollback to zero: Index node1 ( 172.23.106.159) 2022-07-26T05:06:12.658-07:00 [Info] StorageMgr::rollbackAllToZero MAINT_STREAM test8 CPU is high on all 3 of the memcached nodes for the last 10 minutes before the failover – however it doesn't seem to be that high prior to that. That said, I see quite a lot of slow scheduling for the DCP connection notifier task on the non-IO thread workers during this time. $ grep "T0[345].*test8.*Slow scheduling for NON_IO task .DCP connection" memcached.log | grep -o 2022-..-..T..:. | uniq -c 50 2022-07-26T03:0 13 2022-07-26T03:1 14 2022-07-26T03:2 87 2022-07-26T03:3 19 2022-07-26T03:4 65 2022-07-26T03:5 79 2022-07-26T04:0 41 2022-07-26T04:1 13 2022-07-26T04:2 22 2022-07-26T04:3 45 2022-07-26T04:4 27 2022-07-26T04:5 33 2022-07-26T05:0 Some of these waits are pretty lengthy. Dave - is this evidence for KV being slow to get mutations to indexing? Is there a reason for the slowness?
          drigby Dave Rigby added a comment -

          From DaveF:

          Dave - is this evidence for KV being slow to get mutations to indexing? Is there a reason for the slowness?

          Looking at mortimer we for the flush around 03:26, we see the test8 has a bucket flush at [1], followed by the bucket being repopulated with data. Nodes 106.37 and 106.156 hit the high watermark and successfully free memory, dropping back to low watermark and continue to oscillate as more data is added, but node 105.36 exceeds the high_watermark and never comes back down below the high watermark:

          Looking at a breakdown of memory, we see large amounts in CheckpointManager, but not in the disk queue - i.e. the bulk of the memory is outstanding items for DCP stream(s):

          Looking at item counts, we see both replication and GSI have large number of items remaining:

          Essentially the workload is generating more mutations than KV-Engine can transmit via DCP to replicas and indexing nodes; hence a backlog is formed which consumes most of the bucket memory. To attempt to alleviate that KV will switch from in-memory to disk backfill - which will result in larger snapshots being sent:

          2022-07-26T03:32:46.494248-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) Handling slow stream; state_ : in-memory, lastReadSeqno : 5279, lastSentSeqno : 3452, vBucketHighSeqno : 24649, isBackfillTaskRunning : False
          2022-07-26T04:03:48.467911-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) ActiveStream::transitionState: Transitioning from in-memory to backfilling
          2022-07-26T04:03:48.472522-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0" backfill:true, seqno:5873
          2022-07-26T04:03:48.474445-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) Scheduling backfill from 5280 to 5872, reschedule flag : True
          

          Note that we have improved this behaviour in later versions - most notably in 7.1 by limiting the amount of memory which can be used by CheckpontManager (for holding outstanding DCP & disk mutations) - that wouldn't avoid the switch to backfilling, but would keep more memory free in the bucket for normal operations, andr apply backpressure to the client sooner telling them we are temporarily out of memory.

          The later OOM-kill around 05:02 is essentially the same - there rate of ingress mutations is greater than the environment can egress to replication and indexer, resulting in large disk backfill snapshots being sent to indexing.

          Note this setup only has:

          • Only one AuxIO thread, which in 6.6.2 is responsible for running tasks which is responsible for all processing of in-memory Checkpoints into DCP messages - for replicas and indexing, all buckets, and all disk backfills.
          • Two NonIO threads, which are running ItemCompressorTasks (active compression), ItemPaging, and notifying DCP connections data is ready.

          Essentially the system is not well-sized for the workload being run on it, and things are running slow / queues are backing up etc.

          drigby Dave Rigby added a comment - From DaveF: Dave - is this evidence for KV being slow to get mutations to indexing? Is there a reason for the slowness? Looking at mortimer we for the flush around 03:26, we see the test8 has a bucket flush at [1] , followed by the bucket being repopulated with data. Nodes 106.37 and 106.156 hit the high watermark and successfully free memory, dropping back to low watermark and continue to oscillate as more data is added, but node 105.36 exceeds the high_watermark and never comes back down below the high watermark: Looking at a breakdown of memory, we see large amounts in CheckpointManager, but not in the disk queue - i.e. the bulk of the memory is outstanding items for DCP stream(s): Looking at item counts, we see both replication and GSI have large number of items remaining: Essentially the workload is generating more mutations than KV-Engine can transmit via DCP to replicas and indexing nodes; hence a backlog is formed which consumes most of the bucket memory. To attempt to alleviate that KV will switch from in-memory to disk backfill - which will result in larger snapshots being sent: 2022-07-26T03:32:46.494248-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) Handling slow stream; state_ : in-memory, lastReadSeqno : 5279, lastSentSeqno : 3452, vBucketHighSeqno : 24649, isBackfillTaskRunning : False 2022-07-26T04:03:48.467911-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) ActiveStream::transitionState: Transitioning from in-memory to backfilling 2022-07-26T04:03:48.472522-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0" backfill:true, seqno:5873 2022-07-26T04:03:48.474445-07:00 INFO 64: (test8) DCP (Producer) eq_dcpq:secidx:proj-test8-MAINT_STREAM_TOPIC_d44295ce804e3e6e97cc718dd4f42cba-3631624410218667233/0 - (vb:264) Scheduling backfill from 5280 to 5872, reschedule flag : True Note that we have improved this behaviour in later versions - most notably in 7.1 by limiting the amount of memory which can be used by CheckpontManager (for holding outstanding DCP & disk mutations) - that wouldn't avoid the switch to backfilling, but would keep more memory free in the bucket for normal operations, andr apply backpressure to the client sooner telling them we are temporarily out of memory. The later OOM-kill around 05:02 is essentially the same - there rate of ingress mutations is greater than the environment can egress to replication and indexer, resulting in large disk backfill snapshots being sent to indexing. Note this setup only has: Only one AuxIO thread, which in 6.6.2 is responsible for running tasks which is responsible for all processing of in-memory Checkpoints into DCP messages - for replicas and indexing, all buckets, and all disk backfills. Two NonIO threads, which are running ItemCompressorTasks (active compression), ItemPaging, and notifying DCP connections data is ready. Essentially the system is not well-sized for the workload being run on it, and things are running slow / queues are backing up etc.

          Varun Velamuri Pavan PB Since, this is marked for 7.1.2, can you pls also update the due date on this? Thanks! 

          prachi.sikchi Prachi Sikchi added a comment - Varun Velamuri Pavan PB Since, this is marked for 7.1.2, can you pls also update the due date on this? Thanks! 

          I don't think there is anything wrong with indexer in this case. Removing the fix version as 7.1.2 for now.

          varun.velamuri Varun Velamuri added a comment - I don't think there is anything wrong with indexer in this case. Removing the fix version as 7.1.2 for now.

          Pavan PB - Resolving this as not-a-bug as it turned out to be sizing issue (See the comment from Dave Rigby above)

          varun.velamuri Varun Velamuri added a comment - Pavan PB - Resolving this as not-a-bug as it turned out to be sizing issue (See the comment from Dave Rigby above)

          People

            varun.velamuri Varun Velamuri
            pavan.pb Pavan PB
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty