Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.1.0
-
Untriaged
-
1
-
Unknown
Description
What's the problem?
When there is one backup service node in a cluster and it is rebalanced out it will delete itself as the /cbbs/leader in metakv. However, if this node is failed over then it will persist there, which will cause problems if later backup services nodes are added.
What happens?
When a new backup service node is added it will try to ask the old one if it has given up leadership. As this node was failed over this is probably going to fail, for example by not being reachable:
ERROR (Rebalance) Could not confirm old leader stepped down {"err": "exhausted retry count after 5 attempts: could not confirm if old leader '14bc1e6bfcd881d16b3c717d34b780da' had stepped down: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp <ip>: connect: connection refused\""}
|
Any further attempt to rebalance will also fail. We've now set /cbbs/leader in metakv to this second backup service node, but due to the first rebalance failing it isn't in /cbbs/nodes, so we will see:
ERROR (Rebalance) Could not confirm old leader stepped down {"err": "could not get old leader node address"}
|
What should happen?
The design document states:
If the previous leader node is still part of the cluster and it fails to confirm it has stepped down it will fail the rebalance.
Actually checking whether the leader node is still part of the cluster and if not skipping the check should fix this. I believe it is also a sound change
What's the workaround?
Deleting /cbbs/leader from metakv and then triggering a rebalance should fix this
curl -i http://localhost:8091/diag/eval -d 'metakv:delete(<<"/cbbs/leader">>).' -u <username>:<password>
|
Full reproduction
- VAGRANT_NODES=4 vagrant up
- Setup node1 with data
- Add node2 with only backup & rebalance
- sudo systemctl stop couchbase-server on node2
- Failover node2
- Add node3 & rebalance
- This will fail due to not being able to connect to node2 (i.e. it does find the node address in metakv)
- sudo systemctl stop couchbase-server on node3
- Failover node3 which will fail and offer to force a failover. Accept the forced failover
- Add node4 with only backup & rebalance
- This will fail due to not being able to find node3 in metakv
Attachments
Issue Links
- duplicates
-
MB-51355 Backup: Rebalance failed with reason "could not confirm self removed: could not remove self: exhausted retry count after 5 attempts: node status is not out"
- Closed