Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46323

[System Test] Index rebalance took 23h to complete

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-5169
      Test : -test tests/2i/cheshirecat/test_idx_clusterops_cheshire_cat_recovery.yml -scope tests/2i/cheshirecat/scope_idx_cheshire_cat_dgm.yml
      Scale : 2
      Iteration : 1st

      This is the new GSI component test with more recovery steps. After the steady state phase, a rebalance operation is started to add a new indexer node 172.23.96.31 to the cluster. While this rebalance is on, after a few mins, indexer process on 172.23.97.77 is killed. Rebalance fails as expected. This rebalance is automatically retried in a couple of mins. The retried rebalance is hung for about 22 hrs now as 1 index is stuck in moving state.

      Details of the index stuck in moving state :

      {
               "bucket" : "bucket2",
               "collection" : "coll_9",
               "completion" : 100,
               "definition" : "CREATE INDEX `idx1_YXvO` ON `bucket2`.`scope_1`.`coll_9`(`country`,(distinct (array ((`r`.`ratings`).`Check in / front desk`) for `r` in `reviews` end)),array_count(`public_likes`),array_count(`reviews`) DESC,`type`,`phone`,`price`,`email`,`address`,`name`,`url`) WITH {  \"defer_build\":true, \"nodes\":[ \"172.23.96.30:8091\",\"172.23.97.77:8091\",\"172.23.97.82:8091\",\"172.23.97.83:8091\" ], \"num_replica\":2 }",
               "defnId" : 11843842764277554498,
               "hosts" : [
                  "172.23.96.30:8091",
                  "172.23.97.82:8091"
               ],
               "indexName" : "idx1_YXvO",
               "indexType" : "plasma",
               "instId" : 12561991181710981895,
               "lastScanTime" : "Sun May 16 13:06:50 PDT 2021",
               "name" : "idx1_YXvO",
               "numPartition" : 2,
               "numReplica" : 2,
               "partitionMap" : {
                  "172.23.96.30:8091" : [
                     0
                  ],
                  "172.23.97.82:8091" : [
                     0
                  ]
               },
               "partitioned" : false,
               "progress" : 100,
               "replicaId" : 0,
               "scheduled" : false,
               "scope" : "scope_1",
               "secExprs" : [
                  "`country`",
                  "(distinct (array ((`r`.`ratings`).`Check in / front desk`) for `r` in `reviews` end))",
                  "array_count(`public_likes`)",
                  "array_count(`reviews`)",
                  "`type`",
                  "`phone`",
                  "`price`",
                  "`email`",
                  "`address`",
                  "`name`",
                  "`url`"
               ],
               "stale" : false,
               "status" : "Moving"
            }
      

      The rebalance was initiated at 2021-05-15T17:36:26. Following is from the test console :

      [2021-05-15T17:36:26-07:00, sequoiatools/couchbase-cli:7.0:68fafa] server-add -c 172.23.104.16:8091 --server-add https://172.23.96.31 -u Administrator -p password --server-add-username Administrator --server-add-password password --services index
      [2021-05-15T17:36:36-07:00, sequoiatools/couchbase-cli:7.0:6951cf] rebalance -c 172.23.104.16:8091 -u Administrator -p password
      [2021-05-15T17:36:41-07:00, sequoiatools/cmd:e19b37] 60
      [2021-05-15T17:37:47-07:00, sequoiatools/cmd:622ca9] 300
      [pull] vijayviji/sshpass
      [2021-05-15T17:43:21-07:00, vijayviji/sshpass:fbd7e7] sshpass -p couchbase ssh -o StrictHostKeyChecking=no root@172.23.97.77 kill -SIGKILL $(pgrep indexer)
       
      Error occurred on container - sequoiatools/couchbase-cli:7.0:[rebalance -c 172.23.104.16:8091 -u Administrator -p password]
       
      docker logs 6951cf
      docker start 6951cf
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2021-05-15T17:43:26-07:00, sequoiatools/cmd:cb2101] 420
      [2021-05-15T17:50:32-07:00, appropriate/curl:e82955] -s -u Administrator:password 172.23.104.16:8091/pools/default/rebalanceProgress
      

      This issue could be similar to MB-46319, but the builds are different, and so are the tests.

      Indexer nodes in the cluster : 172.23.121.165, 172.23.96.30, 172.23.96.31, 172.23.97.77, 172.23.97.82, 172.23.97.83

      The latest getIndexStatus output is attached. Also, the logs are from ~2 AM on 5/16. Let me know if you need logs from before or after this time.

      Attachments

        1. cpu.svg
          144 kB
        2. profile001.svg
          134 kB

        Issue Links

          For Gerrit Dashboard: MB-46323
          # Subject Branch Project Status CR V

          Activity

            People

              deepkaran.salooja Deepkaran Salooja
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty