Details
-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
2.7.20
-
1
Description
If a client is using the Node Health Failure Detector, and configured their MCA using IP addresses, the download of the cluster map after a cluster switch may put the detector into alert status before the Coordinator has closed the grace period. This could prevent the detector from re-alerting when nodes actually fail because it is already in the alert state.
Sequence of event:
1 - Nodes fail on Cluster 1, detector goes into alert state (RED).
2 - Coordinator enters grace period.
3 - Coordinator switches to Cluster 2, resets detector alert state to GREEN.
4 - Cluster map received, adds nodes using DNS names, disconnects from IP addresses.
5 - Detector picks up the disconnects, goes into alert state (RED).
6 - Coordinator still in grace period, ignores alert, leaves detector in RED state.
7 - When node does fail, Detector picks up but is already in RED state, so no change sent to Coordinator.
Attached a sample from the SDK debug logs showing the sequence.