Description
TestResizeClusterWithBucket
If a rebalance is attempted and the first attempt fails, it will be retried again. Once the rebalance successfully completes, only a rebalance incomplete event is seen. It would make more sense to report rebalance complete in the scenario and only report rebalance incomplete in scenarios where all retries have been exhausted. Otherwise, tests will fail like this:
Expected events to be:
|
Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0000 added to cluster |
Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created |
Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0001 added to cluster |
Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
|
Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
|
Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0002 added to cluster |
Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
|
Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
|
Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
|
Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-mjwqv-0002 removed from the cluster |
Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
|
Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
|
Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-mjwqv-0001 removed from the cluster |
Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
|
|
but got:
|
Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0000 added to cluster |
Type: Normal | Reason: BucketCreated | Message: A new bucket `default` was created |
Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0001 added to cluster |
Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
|
Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
|
Type: Normal | Reason: NewMemberAdded | Message: New member test-couchbase-mjwqv-0002 added to cluster |
Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
|
Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
|
Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
|
Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-mjwqv-0002 removed from the cluster |
Type: Normal | Reason: RebalanceCompleted | Message: A rebalance has completed
|
Type: Normal | Reason: RebalanceStarted | Message: A rebalance has been started to balance data across the cluster
|
Type: Normal | Reason: RebalanceIncomplete | Message: A rebalance is incomplete
|
Type: Normal | Reason: MemberRemoved | Message: Existing member test-couchbase-mjwqv-0001 removed from the cluster |
This looks like rebalanced failed, but the cluster status before the events are checked is "balanced", ""healthy", and "ready".
From looking at the logs this issue is caused by the rebalance status loop exiting early due to an error being returned by the server. In this case the logs show the following message while the rebalance status loop is running:
The loop exits and then when we check to see if a rebalance is still needed ns_server report that it does because the rebalance is still in process.
The problem is that we need to do better error checking in the rebalance status loop. If the call to check the status fails we need to check another node. Even if all nodes fail we should still retry for a certain amount of time (maybe 60 seconds) before giving up. Below is a link the code that needs to be improved.
https://github.com/couchbase/gocbmgr/blob/master/api.go#L132
It should also be noted that it's possible that we may not be able to check the status for 60 seconds and in that case we should skip raising an event or raise an event for rebalance status unknown.