Details
-
Bug
-
Resolution: Fixed
-
Major
-
Cheshire-Cat
-
Enterprise Edition 7.0.0 build 4321
-
Untriaged
-
1
-
Unknown
Description
Description:
Observed a rebalance failure where a worker was terminated abnormally and the old leader address could not be fetched.
Steps to reproduce:
Initial setup:
The initial cluster setup is unknown, but perhaps it can be extracted from the logs (as they're quite short).
The cluster setup when the rebalance failure happens:
3 node cluster. Data service only on node 1. All services apart from analytics on nodes 2 and 3.
I believe the the test code attempted to add nodes 2 and 3 to node 1 and perform a rebalance.
What happens:
The rebalance fails with the following extracts from the logs:
From the backup service logs on node 10.112.210.103:
cbcollect_info_ns_1@10.112.210.103_20210128-104243/ns_server.backup_service.log |
2021-01-28T10:37:10.665Z INFO (Rebalance) Got old leader {"leader": "af16ada7d4774db2b75b6b7d8613f6bb"}
|
2021-01-28T10:37:10.668Z INFO (Rebalance) Got current nodes {"#nodes": 0}
|
2021-01-28T10:37:10.668Z INFO (Rebalance) Setting self as leader
|
2021-01-28T10:37:10.672Z INFO (Rebalance) Checking that old leader stepped down
|
2021-01-28T10:37:10.672Z ERROR (Rebalance) Could not confirm old leader stepped down {"err": "could not get old leader node address"}
|
2021-01-28T10:37:10.672Z INFO (Rebalance) Rebalance done {"err": "could not get old leader node address", "state": {}, "cancelled": false}
|
From the ns server error logs on node 10.112.210.103:
cbcollect_info_ns_1@10.112.210.101_20210128-104244/ns_server.error.log |
[ns_server:error,2021-01-28T10:37:12.257Z,ns_1@10.112.210.101:service_rebalancer-backup<0.1795.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.1810.1>,
|
{rebalance_failed,
|
{service_error,
|
<<"could not get old leader node address">>}}}
|
[user:error,2021-01-28T10:37:12.261Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
|
{worker_died,
|
{'EXIT',<0.1810.1>,
|
{rebalance_failed,
|
{service_error,
|
<<"could not get old leader node address">>}}}}}.
|
Rebalance Operation Id = 8eff1d74c5f02dcb8c99ab6ad942c5f7
|
[ns_server:error,2021-01-28T10:38:35.952Z,ns_1@10.112.210.101:service_rebalancer-backup<0.2546.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.2561.1>,
|
{rebalance_failed,
|
{service_error,
|
<<"could not confirm self removed: could not remove self: node status is not out">>}}}
|
[user:error,2021-01-28T10:38:35.955Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
|
{worker_died,
|
{'EXIT',<0.2561.1>,
|
{rebalance_failed,
|
{service_error,
|
<<"could not confirm self removed: could not remove self: node status is not out">>}}}}}.
|
Rebalance Operation Id = cef5852ca9c67688e2a736950464792b
|
[ns_server:error,2021-01-28T10:39:55.905Z,ns_1@10.112.210.101:service_rebalancer-backup<0.6605.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.6618.1>,
|
{rebalance_failed,
|
{service_error,
|
<<"could not confirm self removed: could not remove self: node status is not out">>}}}
|
[user:error,2021-01-28T10:39:55.907Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
|
{worker_died,
|
{'EXIT',<0.6618.1>,
|
{rebalance_failed,
|
{service_error,
|
<<"could not confirm self removed: could not remove self: node status is not out">>}}}}}.
|
What I expected to happen:
I expected the rebalance to succeed
Logs:
[^collectinfo-2021-01-28T104244-ns_1@10.112.210.101.zip]
[^collectinfo-2021-01-28T104244-ns_1@10.112.210.102.zip]
[^collectinfo-2021-01-28T104244-ns_1@10.112.210.103.zip]
Attachments
Issue Links
- relates to
-
MB-53559 [CBBS] Rebalance fails if previous leader was failed over
- Closed