Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46323

[System Test] Index rebalance took 23h to complete

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-5169
      Test : -test tests/2i/cheshirecat/test_idx_clusterops_cheshire_cat_recovery.yml -scope tests/2i/cheshirecat/scope_idx_cheshire_cat_dgm.yml
      Scale : 2
      Iteration : 1st

      This is the new GSI component test with more recovery steps. After the steady state phase, a rebalance operation is started to add a new indexer node 172.23.96.31 to the cluster. While this rebalance is on, after a few mins, indexer process on 172.23.97.77 is killed. Rebalance fails as expected. This rebalance is automatically retried in a couple of mins. The retried rebalance is hung for about 22 hrs now as 1 index is stuck in moving state.

      Details of the index stuck in moving state :

      {
               "bucket" : "bucket2",
               "collection" : "coll_9",
               "completion" : 100,
               "definition" : "CREATE INDEX `idx1_YXvO` ON `bucket2`.`scope_1`.`coll_9`(`country`,(distinct (array ((`r`.`ratings`).`Check in / front desk`) for `r` in `reviews` end)),array_count(`public_likes`),array_count(`reviews`) DESC,`type`,`phone`,`price`,`email`,`address`,`name`,`url`) WITH {  \"defer_build\":true, \"nodes\":[ \"172.23.96.30:8091\",\"172.23.97.77:8091\",\"172.23.97.82:8091\",\"172.23.97.83:8091\" ], \"num_replica\":2 }",
               "defnId" : 11843842764277554498,
               "hosts" : [
                  "172.23.96.30:8091",
                  "172.23.97.82:8091"
               ],
               "indexName" : "idx1_YXvO",
               "indexType" : "plasma",
               "instId" : 12561991181710981895,
               "lastScanTime" : "Sun May 16 13:06:50 PDT 2021",
               "name" : "idx1_YXvO",
               "numPartition" : 2,
               "numReplica" : 2,
               "partitionMap" : {
                  "172.23.96.30:8091" : [
                     0
                  ],
                  "172.23.97.82:8091" : [
                     0
                  ]
               },
               "partitioned" : false,
               "progress" : 100,
               "replicaId" : 0,
               "scheduled" : false,
               "scope" : "scope_1",
               "secExprs" : [
                  "`country`",
                  "(distinct (array ((`r`.`ratings`).`Check in / front desk`) for `r` in `reviews` end))",
                  "array_count(`public_likes`)",
                  "array_count(`reviews`)",
                  "`type`",
                  "`phone`",
                  "`price`",
                  "`email`",
                  "`address`",
                  "`name`",
                  "`url`"
               ],
               "stale" : false,
               "status" : "Moving"
            }
      

      The rebalance was initiated at 2021-05-15T17:36:26. Following is from the test console :

      [2021-05-15T17:36:26-07:00, sequoiatools/couchbase-cli:7.0:68fafa] server-add -c 172.23.104.16:8091 --server-add https://172.23.96.31 -u Administrator -p password --server-add-username Administrator --server-add-password password --services index
      [2021-05-15T17:36:36-07:00, sequoiatools/couchbase-cli:7.0:6951cf] rebalance -c 172.23.104.16:8091 -u Administrator -p password
      [2021-05-15T17:36:41-07:00, sequoiatools/cmd:e19b37] 60
      [2021-05-15T17:37:47-07:00, sequoiatools/cmd:622ca9] 300
      [pull] vijayviji/sshpass
      [2021-05-15T17:43:21-07:00, vijayviji/sshpass:fbd7e7] sshpass -p couchbase ssh -o StrictHostKeyChecking=no root@172.23.97.77 kill -SIGKILL $(pgrep indexer)
       
      Error occurred on container - sequoiatools/couchbase-cli:7.0:[rebalance -c 172.23.104.16:8091 -u Administrator -p password]
       
      docker logs 6951cf
      docker start 6951cf
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2021-05-15T17:43:26-07:00, sequoiatools/cmd:cb2101] 420
      [2021-05-15T17:50:32-07:00, appropriate/curl:e82955] -s -u Administrator:password 172.23.104.16:8091/pools/default/rebalanceProgress
      

      This issue could be similar to MB-46319, but the builds are different, and so are the tests.

      Indexer nodes in the cluster : 172.23.121.165, 172.23.96.30, 172.23.96.31, 172.23.97.77, 172.23.97.82, 172.23.97.83

      The latest getIndexStatus output is attached. Also, the logs are from ~2 AM on 5/16. Let me know if you need logs from before or after this time.

      Attachments

        1. cpu.svg
          144 kB
        2. profile001.svg
          134 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              deepkaran.salooja Deepkaran Salooja
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty