Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-24242

Rebalance does not fail immediately when there are node failures

    XMLWordPrintable

Details

    Description

      1. Create a cluster with few nodes and buckets in it.
      2. Start a rebalance on the cluster after either adding a server or removing a server from the cluster
      3. While the rebalance is in progress, inject failure into one of the node (Failure like network failure i.e. stopping network on a node, enabling firewall on the node, stopping memcached on the node). 

      We detect the failure on the node immediately. But the rebalance of node does not stop even when there are failures on node(s) and is shown to be struck at whatever percentage it had completed before the node failure. Since we detect a node failure fast, we should stop the rebalance immediately or after a certain amount of time. Since we do not stop the rebalance immediately, the autofailover of a failed node is delayed till rebalance has exited or has been stopped. This causes more down time than users would want. 

      Note that we stop rebalance immediately in node failures like stopping the couchbase server, killing nsserver etc but when the failures are related to network, the rebalance is failed only after long wait period. 

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-24242
          # Subject Branch Project Status CR V

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              bharath.gp Bharath G P
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  PagerDuty