Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31096

[System Test] Rebalance hung on index node while rebalancing cluster after adding a data node

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 5.1.2
    • 5.1.2
    • secondary-index
    • centos cluster

    Description

      Build : 5.1.2-6026
      Test : -test tests/integration/test_XattrsAllFeatures.yml -scope tests/integration/scope_XattrsReplicaIndex.yml
      Scale : 2
      Iteration : 1st

      The test has a step to add a data node in the cluster and perform a rebalance. This rebalance operation is stuck on the indexer nodes and is not making progress for ~8 hrs now.

      The cluster is live at http://172.23.108.103:8091 if you need to take a look.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          The last rebalance, which added the data node 172.23.98.135, didn't do anything as the previous rebalance was already in progress.

           {log_entry,{1535,525023,220620},
                      'ns_1@172.23.98.135',ns_cluster,3,
                      [<<"Node ns_1@172.23.98.135 joined cluster">>],
                      [],info,
                      {{2018,8,28},{23,43,43}}},
           {log_entry,{1535,525033,418257},
                      'ns_1@172.23.108.103',ns_orchestrator,3,
                      [<<"Not rebalancing because rebalance is already in progress.~n">>],
                      [],info,
                      {{2018,8,28},{23,43,53}}},
          

          The previous rebalance, which ejected the indexer node 172.23.99.21, is still running.

           {log_entry,{1535,523583,895668},
                      'ns_1@172.23.108.103',ns_orchestrator,4,
                      [<<"Starting rebalance, KeepNodes = ['ns_1@172.23.106.188','ns_1@172.23.108.103',\n                                 'ns_1@172.23.108.104','ns_1@172.23.96.145',\n 'ns_1@172.23.96.148','ns_1@172.23.96.168',\n                               'ns_1@172.23.96.56','ns_1@172.23.97.238',\n'ns_1@172.23.97.239','ns_1@172.23.97.242',\n                                 'ns_1@172.23.99.25'], EjectNodes = ['ns_1@172.23.99.21'], Failed over and being ejected nodes = []; no delta recovery nodes\n">>],
                      [],info,
                      {{2018,8,28},{23,19,43}}},
          

          On node 172.23.96.56, we see that rebalancer is waiting for index build to finish.

          2018-08-29T06:53:53.028-07:00 [Info] Rebalancer::waitForIndexBuild Index default:default_result: State INDEX_STATE_ACTIVE Pending 6.203065e+06 EstTime 1192
          2018-08-29T06:53:53.028-07:00 [Info] Rebalancer::waitForIndexBuild Index default:default_claims (replica 2): State INDEX_STATE_ACTIVE Pending 6.203065e+06 EstTime 1192
          

          There are 6M pending items. There are roughly 2.5k sets happening per second on this bucket.

          172.23.99.21 cpu is maxed out(see ). There are large scans running on this node (roughly selecting > 1M items per second). The index drain rate is hovering between 0 - 20k per second. With constant CPU saturation, there is very little chance of indexer catching upto 6M mutations.

          We should reduce the scan range of queries (select > 1M rows per second is going to consume CPU).

          Also, system test should let the previous rebalance finish before it triggers the next one.

          deepkaran.salooja Deepkaran Salooja added a comment - The last rebalance, which added the data node 172.23.98.135, didn't do anything as the previous rebalance was already in progress. {log_entry,{1535,525023,220620}, 'ns_1@172.23.98.135',ns_cluster,3, [<<"Node ns_1@172.23.98.135 joined cluster">>], [],info, {{2018,8,28},{23,43,43}}}, {log_entry,{1535,525033,418257}, 'ns_1@172.23.108.103',ns_orchestrator,3, [<<"Not rebalancing because rebalance is already in progress.~n">>], [],info, {{2018,8,28},{23,43,53}}}, The previous rebalance, which ejected the indexer node 172.23.99.21, is still running. {log_entry,{1535,523583,895668}, 'ns_1@172.23.108.103',ns_orchestrator,4, [<<"Starting rebalance, KeepNodes = ['ns_1@172.23.106.188','ns_1@172.23.108.103',\n 'ns_1@172.23.108.104','ns_1@172.23.96.145',\n 'ns_1@172.23.96.148','ns_1@172.23.96.168',\n 'ns_1@172.23.96.56','ns_1@172.23.97.238',\n'ns_1@172.23.97.239','ns_1@172.23.97.242',\n 'ns_1@172.23.99.25'], EjectNodes = ['ns_1@172.23.99.21'], Failed over and being ejected nodes = []; no delta recovery nodes\n">>], [],info, {{2018,8,28},{23,19,43}}}, On node 172.23.96.56, we see that rebalancer is waiting for index build to finish. 2018-08-29T06:53:53.028-07:00 [Info] Rebalancer::waitForIndexBuild Index default:default_result: State INDEX_STATE_ACTIVE Pending 6.203065e+06 EstTime 1192 2018-08-29T06:53:53.028-07:00 [Info] Rebalancer::waitForIndexBuild Index default:default_claims (replica 2): State INDEX_STATE_ACTIVE Pending 6.203065e+06 EstTime 1192 There are 6M pending items. There are roughly 2.5k sets happening per second on this bucket. 172.23.99.21 cpu is maxed out(see ). There are large scans running on this node (roughly selecting > 1M items per second). The index drain rate is hovering between 0 - 20k per second. With constant CPU saturation, there is very little chance of indexer catching upto 6M mutations. We should reduce the scan range of queries (select > 1M rows per second is going to consume CPU). Also, system test should let the previous rebalance finish before it triggers the next one.

          The test was indeed faulty. Modified the longevity test that we had used in Vulcan and removed the Vulcan specific features. The test is running fine now. Closing this bug.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - The test was indeed faulty. Modified the longevity test that we had used in Vulcan and removed the Vulcan specific features. The test is running fine now. Closing this bug.

          People

            mihir.kamdar Mihir Kamdar (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty