Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.1.0
Affects Version/s: 6.5.0
Component/s: ns_server
Labels:

Description

Currently, there is a constraint that only one node in the cluster can be down at a given time to trigger auto-failover, or only one server group is down (which triggers server group failover).

This is to prevent a split-brain situation similar to the following:

4 node cluster, 2 replicas
Network partition between 2 pairs of nodes, so A and B can see each and C and D can see each other

Under the old orchestration mechanism (pre-5.5) one node in each partition would declare itself the orchestrator and unless any other node challenged it, it would be able to perform the auto failover.
Clearly in the scenario above, each partition would have its own orchestrator that without the constraint mentioned initially would failover the two nodes in the other partition, creating two versions of the same cluster (parallel universe?) which would completely break the Couchbase CP model.

In 5.5+ the orchestration is based on leases and quorums of nodes in the cluster. For example to perform auto-failover you'd need a majority quorum, which is (number_nodes + 1 / 2) to perform an automatic failover.

This means that in the hypothetical scenario above, neither of the two partitions would have the necessary quorum to perform an auto-failover, as each partition only has 2 nodes (number_nodes / 2).
In fact, for any possible partition, it would only ever be possible for at most one partition's orchestrator to have the necessary quorum to perform an auto-failover.

The benefit of such an improvement is clear in the next hypothetical scenario:

10 node cluster, 2 replicas
2 nodes become partitioned off from the other 8, meaning there's an 8-2 split.

With the current constraint, no auto-failover would take place here, even though technically it would be safe to do so. Removing that constraint and relying purely on the quorum mechanic for orchestration means that these two nodes could be safely failed over.

I think the use of the quorum means that it should now be possible to failover multiple nodes if they're down at the same time (assuming that other constraints like replica count etc are met), without needing to worry about split brain.