Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 1.0.0
Affects Version/s: 0.7.0
Component/s: operator
Labels:
- kubernetes

Description

May need to modify the order of operations when dealing with rebalance failure on scale down by prioritizing the removal of any down nodes first.

Currently, a failure when scaling down causes the node being removed to become a 'dead member' and so the operator removes it. ie.. here rebalance fails at 0% followed by member being removed:

time="2018-01-11T23:26:20Z" level=info msg="Rebalance progress: 0.000000" cluster-name=test-couchbase-6q66p module=retryutil

time="2018-01-11T23:26:30Z" level=info msg="removing dead member \"test-couchbase-6q66p-0004\"" cluster-name=test-couchbase-6q66p module=cluster

time="2018-01-11T23:26:30Z" level=info msg="deleted pod (test-couchbase-6q66p-0004)" cluster-name=test-couchbase-6q66p module=cluster

time="2018-01-11T23:26:30Z" level=info msg="removed member (test-couchbase-6q66p-0004)" cluster-name=test-couchbase-6q66p module=cluster

At this point the pod is removed but the node is still a part of the cluster so the operator tracks it as an unmanaged node, while the down node still needs to be removed:

time="2018-01-11T23:26:38Z" level=info msg="running members: test-couchbase-6q66p-0003,test-couchbase-6q66p-0000,test-couchbase-6q66p-0001,test-couchbase-6q66p-0002" cluster-name=test-couchbase-6q66p module=cluster

time="2018-01-11T23:26:38Z" level=info msg="cluster membership: test-couchbase-6q66p-0000,test-couchbase-6q66p-0001,test-couchbase-6q66p-0002,test-couchbase-6q66p-0003" cluster-name=test-couchbase-6q66p module=cluster

time="2018-01-11T23:26:38Z" level=info msg="down nodes: test-couchbase-6q66p-0000" cluster-name=test-couchbase-6q66p module=cluster

time="2018-01-11T23:26:38Z" level=info msg="unmanaged nodes: [test-couchbase-6q66p-0004.test-couchbase-6q66p.default.svc:8091]" cluster-name=test-couchbase-6q66p module=cluster

From this point on the operator reports that the cluster is still rebalancing - which blocks additional reconciling

time="2018-01-11T23:32:24Z" level=info msg="Skipping reconcile loop because the cluster is currently rebalancing" cluster-name=test-couchbase-6q66p module=cluster

However the cluster isn't actually rebalancing - which may be a server issue (TBD). But probably we can avoid getting into this state altogether if different actions are taken when first rebalance fails.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Simon Murray

Reporter:: Tommie McAfee (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 11/Jan/18 4:16 PM

Updated:: 27/Apr/18 6:56 AM

Resolved:: 27/Apr/18 6:56 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 3 closed Gerrit changes

Hide There are 3 closed Gerrit changes

K8S-113: Test service down while scaling down: Gerrit Review:

K8S-113: Restore test for service down during rebalance: Gerrit Review:

K8S-113: Failure During Scale Down: Gerrit Review:

Unable to reconcile when service dies while scaling down

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty