Rebalance does not fail immediately when there are node failures

Description

  1. Create a cluster with few nodes and buckets in it.

  2. Start a rebalance on the cluster after either adding a server or removing a server from the cluster

  3. While the rebalance is in progress, inject failure into one of the node (Failure like network failure i.e. stopping network on a node, enabling firewall on the node, stopping memcached on the node). 

We detect the failure on the node immediately. But the rebalance of node does not stop even when there are failures on node(s) and is shown to be struck at whatever percentage it had completed before the node failure. Since we detect a node failure fast, we should stop the rebalance immediately or after a certain amount of time. Since we do not stop the rebalance immediately, the autofailover of a failed node is delayed till rebalance has exited or has been stopped. This causes more down time than users would want. 

Note that we stop rebalance immediately in node failures like stopping the couchbase server, killing nsserver etc but when the failures are related to network, the rebalance is failed only after long wait period. 

Components

Affects versions

Fix versions

Environment

None

Release Notes Description

None

Attachments

1
  • 28 Nov 2019, 09:20 AM

Activity

Show:

Balakumaran Gopal November 28, 2019 at 9:21 AM

Reran the following test on Enterprise Edition 6.5.0 build 4890 ‧ IPv4 © 2019 Couchbase, Inc.

./testrunner -i /tmp/win10-bucket-ops.ini -p -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover_during_rebalance,timeout=5,num_node_failures=1,nodes_in=1,nodes_out=0,failover_action=firewall,nodes_init=3

See

for more details.
Marking this closed.

CB robot October 23, 2018 at 12:36 AM

Build couchbase-server-6.5.0-1467 contains ns_server commit 15388ae with commit message:
https://couchbasecloud.atlassian.net/browse/MB-24242#icft=MB-24242, https://couchbasecloud.atlassian.net/browse/MB-31366#icft=MB-31366: Set relevant vBuckets to ...

Abhijeeth Nuthan August 29, 2018 at 11:13 PM

The solution from ns_server side is incomplete. It was discovered that we wait for buckets to be warmed in memcached at the beginning of delta recovery. But, janitor cleanup on the bucket is run only when rebalance of the bucket starts, which marks the bucket as warmed up. Till the bucket is warmed up the client traffic is not enabled on the bucket and bucket is not ready. On a cluster with multiple buckets, the node being recovered will be considered down by auto-failover logic till rebalance of the last bucket starts and the last bucket becomes ready.
 
The issue is easy to reproduce on cluster with more than 1 bucket. # 3 node cluster. Enable auto-failover with 5s timeout.

  1. Load all 3 sample buckets.

  2. Hard/Graceful failover nodeA.

  3. Mark nodeA for delta recovery. 

  4. Start rebalance.

  5. Auto-failover of nodeA will abort the delta recovery of nodeA. This is undesired.

CB robot July 10, 2018 at 3:32 AM

Build couchbase-server-6.5.0-1057 contains ns_server commit 93b00af with commit message:
https://couchbasecloud.atlassian.net/browse/MB-24242#icft=MB-24242: Feature for auto-failover aborting ...

Abhijeeth Nuthan July 9, 2018 at 8:29 PM

The documents above have been updated and all the server side changes are done. Passing it over to the UI team for changes to auto-failover settings page to include canAbortRebalance setting. 

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Balakumaran Gopal

Reporter

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created May 5, 2017 at 6:50 AM
Updated November 28, 2019 at 9:21 AM
Resolved January 21, 2019 at 8:55 PM
Instabug