Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-24242

Rebalance does not fail immediately when there are node failures

    XMLWordPrintable

Details

    Description

      1. Create a cluster with few nodes and buckets in it.
      2. Start a rebalance on the cluster after either adding a server or removing a server from the cluster
      3. While the rebalance is in progress, inject failure into one of the node (Failure like network failure i.e. stopping network on a node, enabling firewall on the node, stopping memcached on the node). 

      We detect the failure on the node immediately. But the rebalance of node does not stop even when there are failures on node(s) and is shown to be struck at whatever percentage it had completed before the node failure. Since we detect a node failure fast, we should stop the rebalance immediately or after a certain amount of time. Since we do not stop the rebalance immediately, the autofailover of a failed node is delayed till rebalance has exited or has been stopped. This causes more down time than users would want. 

      Note that we stop rebalance immediately in node failures like stopping the couchbase server, killing nsserver etc but when the failures are related to network, the rebalance is failed only after long wait period. 

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            The documents above have been updated and all the server side changes are done. Passing it over to the UI team for changes to auto-failover settings page to include canAbortRebalance setting. 

            Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - The documents above have been updated and all the server side changes are done. Passing it over to the UI team for changes to auto-failover settings page to include canAbortRebalance setting. 

            Build couchbase-server-6.5.0-1057 contains ns_server commit 93b00af with commit message:
            MB-24242: Feature for auto-failover aborting ...

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-1057 contains ns_server commit 93b00af with commit message: MB-24242 : Feature for auto-failover aborting ...

            The solution from ns_server side is incomplete. It was discovered that we wait for buckets to be warmed in memcached at the beginning of delta recovery. But, janitor cleanup on the bucket is run only when rebalance of the bucket starts, which marks the bucket as warmed up. Till the bucket is warmed up the client traffic is not enabled on the bucket and bucket is not ready. On a cluster with multiple buckets, the node being recovered will be considered down by auto-failover logic till rebalance of the last bucket starts and the last bucket becomes ready.
             
            The issue is easy to reproduce on cluster with more than 1 bucket. # 3 node cluster. Enable auto-failover with 5s timeout.

            1. Load all 3 sample buckets.
            2. Hard/Graceful failover nodeA.
            3. Mark nodeA for delta recovery. 
            4. Start rebalance.
            5. Auto-failover of nodeA will abort the delta recovery of nodeA. This is undesired.
            Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - The solution from ns_server side is incomplete. It was discovered that we wait for buckets to be warmed in memcached at the beginning of delta recovery. But, janitor cleanup on the bucket is run only when rebalance of the bucket starts, which marks the bucket as warmed up. Till the bucket is warmed up the client traffic is not enabled on the bucket and bucket is not ready. On a cluster with multiple buckets, the node being recovered will be considered down by auto-failover logic till rebalance of the last bucket starts and the last bucket becomes ready.   The issue is easy to reproduce on cluster with more than 1 bucket. # 3 node cluster. Enable auto-failover with 5s timeout. Load all 3 sample buckets. Hard/Graceful failover nodeA. Mark nodeA for delta recovery.  Start rebalance. Auto-failover of nodeA will abort the delta recovery of nodeA. This is undesired.

            Build couchbase-server-6.5.0-1467 contains ns_server commit 15388ae with commit message:
            MB-24242, MB-31366: Set relevant vBuckets to ...

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-1467 contains ns_server commit 15388ae with commit message: MB-24242 , MB-31366 : Set relevant vBuckets to ...

            Reran the following test on Enterprise Edition 6.5.0 build 4890 ‧ IPv4 © 2019 Couchbase, Inc.

            ./testrunner -i /tmp/win10-bucket-ops.ini -p  -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover_during_rebalance,timeout=5,num_node_failures=1,nodes_in=1,nodes_out=0,failover_action=firewall,nodes_init=3
            

            See for more details.
            Marking this closed.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Reran the following test on Enterprise Edition 6.5.0 build 4890 ‧ IPv4 © 2019 Couchbase, Inc. ./testrunner -i /tmp/win10-bucket-ops.ini -p -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover_during_rebalance,timeout=5,num_node_failures=1,nodes_in=1,nodes_out=0,failover_action=firewall,nodes_init=3 See for more details. Marking this closed.

            People

              Balakumaran.Gopal Balakumaran Gopal
              bharath.gp Bharath G P
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  PagerDuty