Description
The following steps lead to a scenario where nodes are marked as Ready, but are actually down leading to continuously failing reconcile loop:
- Start a 3 node cluster
- Delete 2 nodes 0001 & 0002
This leaves only node 0000 as ready:
Members:
|
Index: 3 |
Ready:
|
Name: cb-example-0000 |
Unready:
|
Name: cb-example-0001 |
Name: cb-example-0002 |
|
- Wait for node 0001 to be started.
- Delete node 0000 while 0001 is being started
At this point 0001 is actually the Ready node, but only 0000 is still marked as ready.
This causes reconcile to fail because we use the Ready members as API clients. The fix is to Resync the ready members after reconcile fails.
Attachments
For Gerrit Dashboard: K8S-605 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
99730,5 | K8S-605: Set rerr when reconcile fails | master | couchbase-operator | Status: MERGED | -2 | +1 |
99817,6 | K8S-605: Ensure Pods for readyMembers are running | master | couchbase-operator | Status: MERGED | +2 | +1 |