Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3471

Operator should not fallback unconditionally to hard failover if graceful failover doesn't work

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.6.4
    • 2.6.4
    • operator
    • 10 - Path of Vengeance
    • 4

    Description

      In 2.6.4 the Operator unconditionally falls back to hard failover if graceful failover doesn't start. I believe this logic is essentially an attempt to deal with the case that non-KV nodes don't support graceful failover and need to be hard failed over.

      In any case, this approach is problematic. In particular, graceful failover can fail for a number of reasons other than the fact that the node in question is a not a KV node. The probable full list of reasons is:

      Reason Meaning REST API Status Error Text
      last_node Invalid attempt to failover the last KV node 400 "Last active node cannot be failed over."
      not_graceful Failing over the specified nodes will result in the loss of some vbuckets 400 "Failover cannot be done gracefully (would lose vbuckets)."
      unknown_node Request to failover a node that is not part of the cluster 400 "Unknown server given."
      inactive_node Request to failover a node that is not active 400 "Inactive server given."
      in_progress Rebalance or graceful failover is currently running 503 "Rebalance running."
      config_sync_failed Cluster was unable to synchronize configuration ahead of starting graceful failover 500 "Failed to synchronize config to other nodes"
      non_kv_node Attempt to failover a non-KV node 400 "Failover cannot be done gracefully for a node without data service. Use hard failover."

      ns-server responds with the specified response code and error text in each of these cases.

      Only in the non_kv_node is it be safe to fallback to hard failover. In every other case we shouldn't be doing this.

      Overall, I think we should change the logic to:

      1. check the services running on each node ahead of time
      2. run graceful failover if the node is running KV and hard failover otherwise
      3. error out of the upgrade step if the failover fails

      And avoid the fallback logic altogether.

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              usamah.jassat Usamah Jassat
              dfinlay Dave Finlay
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty