- Create a cluster with few nodes and buckets in it.
- Start a rebalance on the cluster after either adding a server or removing a server from the cluster
- While the rebalance is in progress, inject failure into one of the node (Failure like network failure i.e. stopping network on a node, enabling firewall on the node, stopping memcached on the node).
We detect the failure on the node immediately. But the rebalance of node does not stop even when there are failures on node(s) and is shown to be struck at whatever percentage it had completed before the node failure. Since we detect a node failure fast, we should stop the rebalance immediately or after a certain amount of time. Since we do not stop the rebalance immediately, the autofailover of a failed node is delayed till rebalance has exited or has been stopped. This causes more down time than users would want.
Note that we stop rebalance immediately in node failures like stopping the couchbase server, killing nsserver etc but when the failures are related to network, the rebalance is failed only after long wait period.