Details
-
Bug
-
Resolution: Fixed
-
Major
-
0.7.0
Description
May need to modify the order of operations when dealing with rebalance failure on scale down by prioritizing the removal of any down nodes first.
Currently, a failure when scaling down causes the node being removed to become a 'dead member' and so the operator removes it. ie.. here rebalance fails at 0% followed by member being removed:
time="2018-01-11T23:26:20Z" level=info msg="Rebalance progress: 0.000000" cluster-name=test-couchbase-6q66p module=retryutil |
|
time="2018-01-11T23:26:30Z" level=info msg="removing dead member \"test-couchbase-6q66p-0004\"" cluster-name=test-couchbase-6q66p module=cluster |
|
time="2018-01-11T23:26:30Z" level=info msg="deleted pod (test-couchbase-6q66p-0004)" cluster-name=test-couchbase-6q66p module=cluster |
|
time="2018-01-11T23:26:30Z" level=info msg="removed member (test-couchbase-6q66p-0004)" cluster-name=test-couchbase-6q66p module=cluster |
At this point the pod is removed but the node is still a part of the cluster so the operator tracks it as an unmanaged node, while the down node still needs to be removed:
time="2018-01-11T23:26:38Z" level=info msg="running members: test-couchbase-6q66p-0003,test-couchbase-6q66p-0000,test-couchbase-6q66p-0001,test-couchbase-6q66p-0002" cluster-name=test-couchbase-6q66p module=cluster |
|
time="2018-01-11T23:26:38Z" level=info msg="cluster membership: test-couchbase-6q66p-0000,test-couchbase-6q66p-0001,test-couchbase-6q66p-0002,test-couchbase-6q66p-0003" cluster-name=test-couchbase-6q66p module=cluster |
|
time="2018-01-11T23:26:38Z" level=info msg="down nodes: test-couchbase-6q66p-0000" cluster-name=test-couchbase-6q66p module=cluster |
|
time="2018-01-11T23:26:38Z" level=info msg="unmanaged nodes: [test-couchbase-6q66p-0004.test-couchbase-6q66p.default.svc:8091]" cluster-name=test-couchbase-6q66p module=cluster |
From this point on the operator reports that the cluster is still rebalancing - which blocks additional reconciling
time="2018-01-11T23:32:24Z" level=info msg="Skipping reconcile loop because the cluster is currently rebalancing" cluster-name=test-couchbase-6q66p module=cluster |
However the cluster isn't actually rebalancing - which may be a server issue (TBD). But probably we can avoid getting into this state altogether if different actions are taken when first rebalance fails.