Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
Description
The fix here is to make sure node is actually a part of the cluster before attempting to reconcile it as a failed add node.
There is a window of time where the operator will add a new member along with it's pod, but the pod has not yet been added to the couchbase cluster. If that same pod is deleted before being added to couchbase cluster the attempt to join the cluster will fail and the member is added to list of failed nodes:
time="2017-12-19T22:39:32Z" level=info msg="failed add nodes: test-couchbase-87srh-0002" cluster-name=test-couchbase-87srh module=cluster |
Next the operator attempts to cancel the failed add node, but that fails because the node was never part of the cluster to begin with:
time="2017-12-19T22:42:32Z" level=warning msg="add node: failed with error Hostname test-couchbase-87srh-0002.test-couchbase-87srh.default.svc is not part of the cluster ...retrying" cluster-name=test-couchbase-87srh module=retryutil |
...
|
time="2017-12-19T22:42:32Z" level=error msg="Unable to removed a failed pending add node: still failing after 36 retries" cluster-name=test-couchbase-87srh module=cluster |
The attempt to cancel failed add node will continue forever.
—
Repro:
E2E_TEST=TestNodeRecoveryKilledNewMember make test-indv