Rebalance does not fail immediately when there are node failures

Description

Create a cluster with few nodes and buckets in it.
Start a rebalance on the cluster after either adding a server or removing a server from the cluster
While the rebalance is in progress, inject failure into one of the node (Failure like network failure i.e. stopping network on a node, enabling firewall on the node, stopping memcached on the node).

We detect the failure on the node immediately. But the rebalance of node does not stop even when there are failures on node(s) and is shown to be struck at whatever percentage it had completed before the node failure. Since we detect a node failure fast, we should stop the rebalance immediately or after a certain amount of time. Since we do not stop the rebalance immediately, the autofailover of a failed node is delayed till rebalance has exited or has been stopped. This causes more down time than users would want.

Note that we stop rebalance immediately in node failures like stopping the couchbase server, killing nsserver etc but when the failures are related to network, the rebalance is failed only after long wait period.

Environment

None

Release Notes Description

None

Attachments

28 Nov 2019, 09:20 AM

Linked issues

blocks

MB-32743

New canAbortRebalance setting in auto failover settings

MB-32744

New canAbortRebalance setting in auto failover settings

relates to

MB-25088

Delta recovery is interrupted: stopping rebalance as we received a "try_autofailover" request

Activity

Show:

Balakumaran Gopal November 28, 2019 at 9:21 AM

./testrunner -i /tmp/win10-bucket-ops.ini -p  -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover_during_rebalance,timeout=5,num_node_failures=1,nodes_in=1,nodes_out=0,failover_action=firewall,nodes_init=3

See

for more details.
Marking this closed.

CB robot October 23, 2018 at 12:36 AM

Build couchbase-server-6.5.0-1467 contains ns_server commit 15388ae with commit message:
https://couchbasecloud.atlassian.net/browse/MB-24242#icft=MB-24242, https://couchbasecloud.atlassian.net/browse/MB-31366#icft=MB-31366: Set relevant vBuckets to ...

Abhijeeth Nuthan August 29, 2018 at 11:13 PM

The solution from ns_server side is incomplete. It was discovered that we wait for buckets to be warmed in memcached at the beginning of delta recovery. But, janitor cleanup on the bucket is run only when rebalance of the bucket starts, which marks the bucket as warmed up. Till the bucket is warmed up the client traffic is not enabled on the bucket and bucket is not ready. On a cluster with multiple buckets, the node being recovered will be considered down by auto-failover logic till rebalance of the last bucket starts and the last bucket becomes ready.

The issue is easy to reproduce on cluster with more than 1 bucket. # 3 node cluster. Enable auto-failover with 5s timeout.

Load all 3 sample buckets.
Hard/Graceful failover nodeA.
Mark nodeA for delta recovery.
Start rebalance.
Auto-failover of nodeA will abort the delta recovery of nodeA. This is undesired.

CB robot July 10, 2018 at 3:32 AM

Build couchbase-server-6.5.0-1057 contains ns_server commit 93b00af with commit message:
https://couchbasecloud.atlassian.net/browse/MB-24242#icft=MB-24242: Feature for auto-failover aborting ...

Abhijeeth Nuthan July 9, 2018 at 8:29 PM

The documents above have been updated and all the server side changes are done. Passing it over to the UI team for changes to auto-failover settings page to include canAbortRebalance setting.

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Balakumaran Gopal

Reporter

Bharath Prabhakar

Priority

Critical

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created May 5, 2017 at 6:50 AM

Updated November 28, 2019 at 9:21 AM

Resolved January 21, 2019 at 8:55 PM

Configure

Instabug

Rebalance does not fail immediately when there are node failures

Description

Components

Affects versions

Fix versions

Labels

Environment

Release Notes Description

Attachments

Linked issues

blocks

relates to

Activity

Balakumaran Gopal November 28, 2019 at 9:21 AM

CB robot October 23, 2018 at 12:36 AM

Abhijeeth Nuthan August 29, 2018 at 11:13 PM

CB robot July 10, 2018 at 3:32 AM

Abhijeeth Nuthan July 9, 2018 at 8:29 PM

Details

Assignee

Reporter

Priority

Instabug

PagerDuty

Sentry

Zendesk Support