Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.2.0
Affects Version/s: 7.1.0
Component/s: tools
Labels:
- approved-for-7.2.0

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown

Description

What's the problem?

When there is one backup service node in a cluster and it is rebalanced out it will delete itself as the /cbbs/leader in metakv. However, if this node is failed over then it will persist there, which will cause problems if later backup services nodes are added.

What happens?

When a new backup service node is added it will try to ask the old one if it has given up leadership. As this node was failed over this is probably going to fail, for example by not being reachable:

ERROR (Rebalance) Could not confirm old leader stepped down {"err": "exhausted retry count after 5 attempts: could not confirm if old leader '14bc1e6bfcd881d16b3c717d34b780da' had stepped down: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp <ip>: connect: connection refused\""}

Any further attempt to rebalance will also fail. We've now set /cbbs/leader in metakv to this second backup service node, but due to the first rebalance failing it isn't in /cbbs/nodes, so we will see:

ERROR (Rebalance) Could not confirm old leader stepped down {"err": "could not get old leader node address"}

What should happen?

The design document states:

If the previous leader node is still part of the cluster and it fails to confirm it has stepped down it will fail the rebalance.

Actually checking whether the leader node is still part of the cluster and if not skipping the check should fix this. I believe it is also a sound change

What's the workaround?

Deleting /cbbs/leader from metakv and then triggering a rebalance should fix this

curl -i http://localhost:8091/diag/eval -d 'metakv:delete(<<"/cbbs/leader">>).' -u <username>:<password>

Full reproduction

VAGRANT_NODES=4 vagrant up
Setup node1 with data
Add node2 with only backup & rebalance
sudo systemctl stop couchbase-server on node2
Failover node2
Add node3 & rebalance
1. This will fail due to not being able to connect to node2 (i.e. it does find the node address in metakv)
sudo systemctl stop couchbase-server on node3
Failover node3 which will fail and offer to force a failover. Accept the forced failover
Add node4 with only backup & rebalance
1. This will fail due to not being able to find node3 in metakv

Attachments

Issue Links

duplicates

MB-51355 Backup: Rebalance failed with reason "could not confirm self removed: could not remove self: exhausted retry count after 5 attempts: node status is not out"

Closed

relates to

MB-50839 Backup: Service rebalance failed with reason "service_rebalance_failed" during service_agent,long_poll_worker_loop

Closed

MB-43924 Backup Service - Rebalance failure (leader address could not be fetched)

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Gilad Kalchheim

Reporter:: Matt Hall

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 31/Aug/22 7:57 AM

Updated:: 31/Mar/23 9:15 AM

Resolved:: 14/Feb/23 2:27 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-53559 Fix rebalance failing when node was failed over: Gerrit Review:

Merge branch 'neo' into master: Gerrit Review:

[CBBS] Rebalance fails if previous leader was failed over

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty