There was some offline discussion with Artem on this as he reached out to me for PM input:
Hi Shivani,
Here https://issues.couchbase.com/browse/MB-49795 2 KV nodes are down and it is safe to fail them over. But the autofailover limit allows to fail over just one node.
Therefore, the current logic picks one node of 2 and tries to fail it over. The safety check passes, but after we try to fetch the max replica numbers from the remaining KV nodes (to ensure durability) and fail because one of the KV nodes is down.
That results in repeated failover error until one of the nodes goes up.
I thought about the whole situation and came to the conclusion that the simplest thing we can do is not even try to auto fail over the partial list of KV nodes.
So if we have a group of KV nodes that we consider unhealthy but there's not enough limit left to fail it over as a whole, we just notify the administrator and do nothing.
Do you agree?
Thanks,
Artem
This was my initial response:
Hi Artem,
I don’t think I agree.
We should failover as many nodes (as long as it is safe) until the max count.
What I don’t understand is the following:
>>The safety check passes, but after we try to fetch the max replica numbers from the remaining KV nodes (to ensure durability) and fail because one of the KV nodes is down.
Why is the durability factor checked for failing over? That should never be the case. We know majority may not be achievable after failover. As long as there is one data copy for all vbuckets we should failover (which is probably what the safety check is). Is this durability check something new you have added?
Also, we should fix the parameter, UI and error messages to say ‘max number of auto-failover nodes’ rather than ‘max number of auto-failover events’. Let me know if you would like me to file this bug.
Thanks
--Shivani
But then Artem explained further (and we also discussed in a meeting):
Hi Shivani,
For durability aware failover we need to promote replicas with the highest seq no. To do that we need to find out which replicas have the highest seqno. So, for each chain that lost its master partition we need to query seqno's of replicas from the replica nodes. If one of such nodes is down, the failover will fail. In most of the cases durability aware failover needs all other KV nodes to be available and responding to succeed, which is not the case if we fail over just a portion of down KV nodes.
Ticket for the autofailover limit label: https://issues.couchbase.com/browse/MB-49563. The change should be already in place.
Thanks,
Artem
So basically there is a risk of losing previously done Durable writes if we pick one of the nodes to failover and leave the other one as is. Hence the decision we came to is the following:
If the auto failover node limit is exceeded if all these nodes are failed over, then do not fail over any. So all or nothing behavior for multi-node concurrent failover.
Artem Stemkovski was also going to double check with Aliaksey Artamonau and Dave Finlay .
Additionally I made the following request:
As for the autofailover limit label, we should fix it in all places (not just UI). E.g. the error message says the following:
Could not auto-failover more nodes ('[ns_1@172.23.100.14
). Maximum number of auto-failover events (1) has been reached
The error message should say 'maximum number of auto-failover nodes has been reached' and not use the word 'events'. I did not file a bug for fixing the error messages as Artem said he will take care of fixing them. Let me know if a bug should be filed.
I did file a DOC bug for the same: DOC-9489
This seems to be unrelated to System Events feature. Meni Hillel: I am not sure but probably Artem should take a look since it relates to MultiNodefailure?