Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-33628

Allow automatic failover of multiple nodes simultaneously

    XMLWordPrintable

Details

    Description

      Currently, there is a constraint that only one node in the cluster can be down at a given time to trigger auto-failover, or only one server group is down (which triggers server group failover).

      This is to prevent a split-brain situation similar to the following:

      • 4 node cluster, 2 replicas
      • Network partition between 2 pairs of nodes, so A and B can see each and C and D can see each other

      Under the old orchestration mechanism (pre-5.5) one node in each partition would declare itself the orchestrator and unless any other node challenged it, it would be able to perform the auto failover.
      Clearly in the scenario above, each partition would have its own orchestrator that without the constraint mentioned initially would failover the two nodes in the other partition, creating two versions of the same cluster (parallel universe?) which would completely break the Couchbase CP model.

      In 5.5+ the orchestration is based on leases and quorums of nodes in the cluster. For example to perform auto-failover you'd need a majority quorum, which is (number_nodes + 1 / 2) to perform an automatic failover.

      This means that in the hypothetical scenario above, neither of the two partitions would have the necessary quorum to perform an auto-failover, as each partition only has 2 nodes (number_nodes / 2).
      In fact, for any possible partition, it would only ever be possible for at most one partition's orchestrator to have the necessary quorum to perform an auto-failover.

      The benefit of such an improvement is clear in the next hypothetical scenario:

      • 10 node cluster, 2 replicas
      • 2 nodes become partitioned off from the other 8, meaning there's an 8-2 split.

      With the current constraint, no auto-failover would take place here, even though technically it would be safe to do so. Removing that constraint and relying purely on the quorum mechanic for orchestration means that these two nodes could be safely failed over.

      I think the use of the quorum means that it should now be possible to failover multiple nodes if they're down at the same time (assuming that other constraints like replica count etc are met), without needing to worry about split brain.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              artem Artem Stemkovski
              matt.carabine Matt Carabine (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty