Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 2.6.4
Affects Version/s: 2.6.4
Component/s: operator
Labels:
- releasenote

Sprint:
10 - Path of Vengeance
Story Points:
4

Description

In 2.6.4 the Operator unconditionally falls back to hard failover if graceful failover doesn't start. I believe this logic is essentially an attempt to deal with the case that non-KV nodes don't support graceful failover and need to be hard failed over.

In any case, this approach is problematic. In particular, graceful failover can fail for a number of reasons other than the fact that the node in question is a not a KV node. The probable full list of reasons is:

Reason	Meaning	REST API Status	Error Text
last_node	Invalid attempt to failover the last KV node	400	"Last active node cannot be failed over."
not_graceful	Failing over the specified nodes will result in the loss of some vbuckets	400	"Failover cannot be done gracefully (would lose vbuckets)."
unknown_node	Request to failover a node that is not part of the cluster	400	"Unknown server given."
inactive_node	Request to failover a node that is not active	400	"Inactive server given."
in_progress	Rebalance or graceful failover is currently running	503	"Rebalance running."
config_sync_failed	Cluster was unable to synchronize configuration ahead of starting graceful failover	500	"Failed to synchronize config to other nodes"
non_kv_node	Attempt to failover a non-KV node	400	"Failover cannot be done gracefully for a node without data service. Use hard failover."

ns-server responds with the specified response code and error text in each of these cases.

Only in the non_kv_node is it be safe to fallback to hard failover. In every other case we shouldn't be doing this.

Overall, I think we should change the logic to:

check the services running on each node ahead of time
run graceful failover if the node is running KV and hard failover otherwise
error out of the upgrade step if the failover fails

And avoid the fallback logic altogether.

Attachments

Issue Links

relates to

K8S-3446 Cloud Native 2.6.4 - Release Notes

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Usamah Jassat

Reporter:: Dave Finlay

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 11/May/24 3:49 PM

Updated:: 25/Jun/24 7:52 AM

Resolved:: 22/May/24 1:51 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

K8S-3471: Retry graceful failover when possible: Gerrit Review:

Operator should not fallback unconditionally to hard failover if graceful failover doesn't work

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty