Details
-
Bug
-
Resolution: Fixed
-
Critical
-
5.5.0
-
Untriaged
-
1
-
Unknown
Description
A customer uses a custom automation script to perform node failover based on events. As part of the automation script they use `couchbase-cli` to failover the node which will trigger "unsafe" failover for the node by default (behavior is changed in 6.6 via MB-39220).
However, even when the failover is "unsafe" the cluster manager is waiting for quorum for 2000ms before proceeding with the failover. During failover, for each bucket we call the janitor:cleanup where we go through leader_activities and wait for quorum again. This makes the quorum wait time proportional to number of buckets.
And when we specify the failover as "unsafe", is it expected that the cluster manager to wait for quorum? (Especially on a 2 node cluster, if one node is down the other node won't get the quorum).
Aliaksey looked at the logs and said:
It's an interesting corner case that we should probably address. In the meantime, a workaround for them is to run the following via /diag/eval:
ns_config:set({timeout,{leader_activities,unsafe_preconditions_timeout}}, 0).
|
Attachments
Issue Links
- relates to
-
MB-50209 Allow remote modification of `unsafe_preconditions_timeout`.
- Resolved