Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50193

Scans failing with timeout and indexes not found with rebalance-in + indexer DGM

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 7.1.0
    • 7.1.0
    • secondary-index

    Description

      Build: 7.1.0-1960

      Test: Rebalance-in (min) 3->4, 1 bucket x 55M x 1KB, 100 indexes, 10K KV ops/sec, Scan Workload concurrency=128, Plasma, s=1 c=10 
      2021-12-21T16:09:42 [INFO] Average RR over all Indexes : 0.3339540786186384

      05:36:15 2021-12-21T16:09:06.663-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf5002328) response failed `Index scan timed out`
      05:36:15 2021-12-21T16:09:06.663-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 2 
      05:36:15 2021-12-21T16:09:06.768-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf5002336) response failed `Index scan timed out`
      05:36:15 2021-12-21T16:09:06.768-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 2 
      05:36:15 2021-12-21T16:09:06.967-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf5002338) response failed `Index scan timed out`
      05:36:15 2021-12-21T16:09:06.967-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 1 
      05:36:15 2021-12-21T16:09:07.288-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf5002356) response failed `Index scan timed out`
      05:36:15 2021-12-21T16:09:07.288-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 1 
      05:36:16 2021-12-21T16:09:07.814-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf5002368) response failed `Index scan timed out`
      05:36:16 2021-12-21T16:09:07.814-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 1 
      05:36:16 2021-12-21T16:09:07.878-08:00 [Error] [GsiScanClient:"172.23.110.55:9101"] Range(./opt/couchbase/bin/cbindexperf5001816) response failed `Index scan timed out`
      05:36:16 2021-12-21T16:09:07.878-08:00 [Error] PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 2  

      Test job: http://perf.jenkins.couchbase.com/job/aether/1291/consoleFull 

      cbmonitor: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=aether_710-1960_rebalance_571a 

      Another interesting part is we have more indexes after rebalance
      23:09:13 2021-12-21T09:41:53 [INFO] 172.23.110.72 : 215 Indexes
      23:09:13 2021-12-21T09:41:53 [INFO] 172.23.110.55 : 187 Indexes
      23:09:13 2021-12-21T09:41:53 [INFO] 172.23.110.56 : 198 Indexes

      05:36:50 2021-12-21T16:09:41 [INFO] Indexes after rebalance
      05:36:50 2021-12-21T16:09:41 [INFO] 172.23.110.72 : 177 Indexes
      05:36:50 2021-12-21T16:09:41 [INFO] 172.23.110.55 : 159 Indexes
      05:36:50 2021-12-21T16:09:41 [INFO] 172.23.110.56 : 156 Indexes
      05:36:50 2021-12-21T16:09:41 [INFO] 172.23.110.71 : 110 Indexes

       

      600 indexes before, 602 after.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            You can grep for the following message tagged "Rebalancer::selectSmartToBuildTokensLOCKED:" to see if it did any replica repairs during the rebalance ("RR shared" + "RR" counts):

            "%v Selected %v tokens (%v stream shares, %v low load, %v high load):"+
            " RR shared %v, RR %v, Partn shared %v, Coll shared %v, Other %v"

            kevin.cherkauer Kevin Cherkauer added a comment - You can grep for the following message tagged "Rebalancer::selectSmartToBuildTokensLOCKED:" to see if it did any replica repairs during the rebalance ("RR shared" + "RR" counts): "%v Selected %v tokens (%v stream shares, %v low load, %v high load):" + " RR shared %v, RR %v, Partn shared %v, Coll shared %v, Other %v"

            from indexer logs on node 110.55 we can see there are lot of these messages

            2021-12-21T09:56:14.232-08:00 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 6000 Milliseconds KeyspaceId bucket-1 Vbucket 226 

            vbucket 226 and 771 show the node alloc wait times going all the way up to  300000 milisecond. And as a result forced allocation happening.

            2021-12-21T16:14:09.691-08:00 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 300000 Milliseconds KeyspaceId bucket-1 Vbucket 7712021-12-21T16:14:09.692-08:00 [Warn] Indexer::MutationQueue Max Wait Period for Node Alloc Expired 300001. Forcing Alloc. KeyspaceId bucket-1 Vbucket 7712021-12-21T16:14:09.692-08:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 94.295779% blocked 

            It appears this led to new mutations not getting snapshoted which has caused the session consistent scan failures.

            2021-12-21T09:57:57.947-08:00 [Info] SCAN##3375943 Error in getRequestedIndexSnapshot Index scan timed out2021-12-21T09:57:57.947-08:00 [Info] SCAN##3375943::logSnapInfoAtTimeout Vbucket seqno in snapTs is lagging reqTs at timeout. InstId: 8488280716476285149, snapTs.Seqnos[23]: 124255, reqTs.Seqnos[23]: 124256
            2021-12-21T09:57:57.947-08:00 [Info] SCAN##3375943::logSnapInfoAtTimeout No snapshot has been generated since last 60 seconds, snapTs.Seqnos: [123501 123334...... 

            Need to see if lack of available free memory is the only cause for the issue or something else is in works here, as rebalance started index builds were in progress on the same bucket.

            yogendra.acharya Yogendra Acharya added a comment - from indexer logs on node 110.55 we can see there are lot of these messages 2021 - 12 -21T09: 56 : 14.232 - 08 : 00 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 6000 Milliseconds KeyspaceId bucket- 1 Vbucket 226 vbucket 226 and 771 show the node alloc wait times going all the way up to  300000 milisecond. And as a result forced allocation happening. 2021 - 12 -21T16: 14 : 09.691 - 08 : 00 [Warn] Indexer::MutationQueue Waiting for Node Alloc for 300000 Milliseconds KeyspaceId bucket- 1 Vbucket 7712021 - 12 -21T16: 14 : 09.692 - 08 : 00 [Warn] Indexer::MutationQueue Max Wait Period for Node Alloc Expired 300001 . Forcing Alloc. KeyspaceId bucket- 1 Vbucket 7712021 - 12 -21T16: 14 : 09.692 - 08 : 00 [Info] DATP[->dataport ":9105" ] DATP -> Indexer 94.295779 % blocked It appears this led to new mutations not getting snapshoted which has caused the session consistent scan failures. 2021 - 12 -21T09: 57 : 57.947 - 08 : 00 [Info] SCAN## 3375943 Error in getRequestedIndexSnapshot Index scan timed out2021- 12 -21T09: 57 : 57.947 - 08 : 00 [Info] SCAN## 3375943 ::logSnapInfoAtTimeout Vbucket seqno in snapTs is lagging reqTs at timeout. InstId: 8488280716476285149 , snapTs.Seqnos[ 23 ]: 124255 , reqTs.Seqnos[ 23 ]: 124256 2021 - 12 -21T09: 57 : 57.947 - 08 : 00 [Info] SCAN## 3375943 ::logSnapInfoAtTimeout No snapshot has been generated since last 60 seconds, snapTs.Seqnos: [ 123501 123334 ...... Need to see if lack of available free memory is the only cause for the issue or something else is in works here, as rebalance started index builds were in progress on the same bucket.
            yogendra.acharya Yogendra Acharya added a comment - - edited

            we can see that for INIT_STREAM on keyspace bucket-1:scope-1:collection-2  initFlushTs  is lagging mergeTs (MAINT_STREAM) due to this we force commit a snapshot 

            2021-12-21T09:56:00.758-08:00 [Info] Timekeeper::checkInitStreamReadyToMerge Stream INIT_STREAM KeyspaceId bucket-1:scope-1:collection-2 Forcing commiting snapshot with mergeTs

             but even after this the merge of INIT_STREAM and MAINT_STREAM can not be seen in-fact we see timekeeper now stuck in loop checking if streams can be merged for this keyspace. This message appears every 5 secs.

            2021-12-21T09:56:02.304-08:00 [Info] Timekeeper::generateNewStabilityTS INIT_STREAM bucket-1:scope-1:collection-2 Check pending stream merge.

             With the force commit done above the MAINT_STREAM as well as INIT_STREAM both have been marked for flushEnabled = false in timekeeper. Need to see if this could cause mutation queues to stall. Also Need to find further why the timekeeper is stuck in loop. 

            yogendra.acharya Yogendra Acharya added a comment - - edited we can see that for INIT_STREAM on keyspace bucket-1:scope-1:collection-2  initFlushTs  is lagging mergeTs (MAINT_STREAM) due to this we force commit a snapshot  2021-12-21T09:56:00.758-08:00 [Info] Timekeeper::checkInitStreamReadyToMerge Stream INIT_STREAM KeyspaceId bucket-1:scope-1:collection-2 Forcing commiting snapshot with mergeTs  but even after this the merge of INIT_STREAM and MAINT_STREAM can not be seen in-fact we see timekeeper now stuck in loop checking if streams can be merged for this keyspace. This message appears every 5 secs. 2021-12-21T09:56:02.304-08:00 [Info] Timekeeper::generateNewStabilityTS INIT_STREAM bucket-1:scope-1:collection-2 Check pending stream merge.  With the force commit done above the MAINT_STREAM as well as INIT_STREAM both have been marked for flushEnabled = false in timekeeper. Need to see if this could cause mutation queues to stall. Also Need to find further why the timekeeper is stuck in loop. 

            So initFlushTs is smaller than minMergeTs as well as bucket sequence numbers from KV which is why we are failing to merge the streams. minMergeTs comes from kv sender as result of ADD_INDX_LIST_TO_STREAM on maintStream. we will need to check the content of minMergeTs but its not printed anywhere. (bucklet seq numbers higher than initFlushTs is expected and not a problem)

            I will try to add some logs and get a repro so that we can check what is minMergeTs which should provide some clue about possible fix. 

            yogendra.acharya Yogendra Acharya added a comment - So initFlushTs is smaller than minMergeTs as well as bucket sequence numbers from KV which is why we are failing to merge the streams. minMergeTs comes from kv sender as result of ADD_INDX_LIST_TO_STREAM on maintStream. we will need to check the content of minMergeTs but its not printed anywhere. (bucklet seq numbers higher than initFlushTs is expected and not a problem) I will try to add some logs and get a repro so that we can check what is minMergeTs which should provide some clue about possible fix. 

            Yogendra Acharya, can you please check if it is the same issue as MB-50165.

            deepkaran.salooja Deepkaran Salooja added a comment - Yogendra Acharya , can you please check if it is the same issue as MB-50165 .

            Deepkaran Salooja yes this is duplicate of MB-50165. I do not have minMergeTs printed in  logs but the other logs from checkInitStream show the ready to merge check failed due to  initFlushTs being smaller than minMergeTs, forced snapshot and both streams have flush disabled. marking this as duplicate

            yogendra.acharya Yogendra Acharya added a comment - Deepkaran Salooja yes this is duplicate of MB-50165 . I do not have minMergeTs printed in  logs but the other logs from checkInitStream show the ready to merge check failed due to  initFlushTs being smaller than minMergeTs, forced snapshot and both streams have flush disabled. marking this as duplicate

            Not seeing on 7.1.0-2021

            vikas.chaudhary Vikas Chaudhary added a comment - Not seeing on 7.1.0-2021

            People

              vikas.chaudhary Vikas Chaudhary
              vikas.chaudhary Vikas Chaudhary
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty