Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40375

Hard/unsafe failover checks preconditions more than once

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      A customer uses a custom automation script to perform node failover based on events. As part of the automation script they use `couchbase-cli` to failover the node which will trigger "unsafe" failover for the node by default (behavior is changed in 6.6 via MB-39220).

      However, even when the failover is "unsafe" the cluster manager is waiting for quorum for 2000ms before proceeding with the failover. During failover, for each bucket we call the janitor:cleanup where we go through leader_activities and wait for quorum again. This makes the quorum wait time proportional to number of buckets.

      And when we specify the failover as "unsafe", is it expected that the cluster manager to wait for quorum? (Especially on a 2 node cluster, if one node is down the other node won't get the quorum).

      Aliaksey looked at the logs and said:

      It's an interesting corner case that we should probably address. In the meantime, a workaround for them is to run the following via /diag/eval:

      ns_config:set({timeout,{leader_activities,unsafe_preconditions_timeout}}, 0).
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              steve.watanabe Steve Watanabe
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty