Details
-
Bug
-
Resolution: Fixed
-
Critical
-
2.6.4
-
10 - Path of Vengeance
-
4
Description
In 2.6.4 the Operator unconditionally falls back to hard failover if graceful failover doesn't start. I believe this logic is essentially an attempt to deal with the case that non-KV nodes don't support graceful failover and need to be hard failed over.
In any case, this approach is problematic. In particular, graceful failover can fail for a number of reasons other than the fact that the node in question is a not a KV node. The probable full list of reasons is:
Reason | Meaning | REST API Status | Error Text |
---|---|---|---|
last_node | Invalid attempt to failover the last KV node | 400 | "Last active node cannot be failed over." |
not_graceful | Failing over the specified nodes will result in the loss of some vbuckets | 400 | "Failover cannot be done gracefully (would lose vbuckets)." |
unknown_node | Request to failover a node that is not part of the cluster | 400 | "Unknown server given." |
inactive_node | Request to failover a node that is not active | 400 | "Inactive server given." |
in_progress | Rebalance or graceful failover is currently running | 503 | "Rebalance running." |
config_sync_failed | Cluster was unable to synchronize configuration ahead of starting graceful failover | 500 | "Failed to synchronize config to other nodes" |
non_kv_node | Attempt to failover a non-KV node | 400 | "Failover cannot be done gracefully for a node without data service. Use hard failover." |
ns-server responds with the specified response code and error text in each of these cases.
Only in the non_kv_node is it be safe to fallback to hard failover. In every other case we shouldn't be doing this.
Overall, I think we should change the logic to:
- check the services running on each node ahead of time
- run graceful failover if the node is running KV and hard failover otherwise
- error out of the upgrade step if the failover fails
And avoid the fallback logic altogether.
Attachments
Issue Links
- relates to
-
K8S-3446 Cloud Native 2.6.4 - Release Notes
- Closed
For Gerrit Dashboard: K8S-3471 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
210240,3 | K8S-3471: Retry graceful failover when possible | 2.6.x | couchbase-operator | Status: MERGED | +2 | +1 |