Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 7.0.0
Affects Version/s: Cheshire-Cat
Component/s: tools
Labels:
- functional-test
- tools
Environment:
Enterprise Edition 7.0.0 build 4321

Triage:
Untriaged
Epic Link:
Tools: Couchbase Backup Service
Story Points:
1
Is this a Regression?:
Unknown

Description

Description:

Observed a rebalance failure where a worker was terminated abnormally and the old leader address could not be fetched.

Steps to reproduce:

Initial setup:

The initial cluster setup is unknown, but perhaps it can be extracted from the logs (as they're quite short).

The cluster setup when the rebalance failure happens:

3 node cluster. Data service only on node 1. All services apart from analytics on nodes 2 and 3.

I believe the the test code attempted to add nodes 2 and 3 to node 1 and perform a rebalance.

What happens:

The rebalance fails with the following extracts from the logs:

From the backup service logs on node 10.112.210.103:

cbcollect_info_ns_1@10.112.210.103_20210128-104243/ns_server.backup_service.log
2021-01-28T10:37:10.665Z INFO (Rebalance) Got old leader {"leader": "af16ada7d4774db2b75b6b7d8613f6bb"}
2021-01-28T10:37:10.668Z INFO (Rebalance) Got current nodes {"#nodes": 0}
2021-01-28T10:37:10.668Z INFO (Rebalance) Setting self as leader
2021-01-28T10:37:10.672Z INFO (Rebalance) Checking that old leader stepped down
2021-01-28T10:37:10.672Z ERROR (Rebalance) Could not confirm old leader stepped down {"err": "could not get old leader node address"}
2021-01-28T10:37:10.672Z INFO (Rebalance) Rebalance done {"err": "could not get old leader node address", "state": {}, "cancelled": false}

From the ns server error logs on node 10.112.210.103:

cbcollect_info_ns_1@10.112.210.101_20210128-104244/ns_server.error.log
[ns_server:error,2021-01-28T10:37:12.257Z,ns_1@10.112.210.101:service_rebalancer-backup<0.1795.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.1810.1>,
{rebalance_failed,
{service_error,
<<"could not get old leader node address">>}}}
[user:error,2021-01-28T10:37:12.261Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
{worker_died,
{'EXIT',<0.1810.1>,
{rebalance_failed,
{service_error,
<<"could not get old leader node address">>}}}}}.
Rebalance Operation Id = 8eff1d74c5f02dcb8c99ab6ad942c5f7
[ns_server:error,2021-01-28T10:38:35.952Z,ns_1@10.112.210.101:service_rebalancer-backup<0.2546.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.2561.1>,
{rebalance_failed,
{service_error,
<<"could not confirm self removed: could not remove self: node status is not out">>}}}
[user:error,2021-01-28T10:38:35.955Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
{worker_died,
{'EXIT',<0.2561.1>,
{rebalance_failed,
{service_error,
<<"could not confirm self removed: could not remove self: node status is not out">>}}}}}.
Rebalance Operation Id = cef5852ca9c67688e2a736950464792b
[ns_server:error,2021-01-28T10:39:55.905Z,ns_1@10.112.210.101:service_rebalancer-backup<0.6605.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.6618.1>,
{rebalance_failed,
{service_error,
<<"could not confirm self removed: could not remove self: node status is not out">>}}}
[user:error,2021-01-28T10:39:55.907Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
{worker_died,
{'EXIT',<0.6618.1>,
{rebalance_failed,
{service_error,
<<"could not confirm self removed: could not remove self: node status is not out">>}}}}}.

What I expected to happen:

I expected the rebalance to succeed

Logs:

[^collectinfo-2021-01-28T104244-ns_1@10.112.210.101.zip]
[^collectinfo-2021-01-28T104244-ns_1@10.112.210.102.zip]
[^collectinfo-2021-01-28T104244-ns_1@10.112.210.103.zip]

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

collectinfo-2021-01-28T104244-ns_1@10.112.210.101.zip
1.28 MB
28/Jan/21 3:27 AM
collectinfo-2021-01-28T104244-ns_1@10.112.210.102.zip
1.54 MB
28/Jan/21 3:27 AM
collectinfo-2021-01-28T104244-ns_1@10.112.210.103.zip
1.55 MB
28/Jan/21 3:27 AM

Issue Links

relates to

MB-53559 [CBBS] Rebalance fails if previous leader was failed over

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Carlos Gonzalez Betancort (Inactive)

Reporter:: Asad Zaidi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Jan/21 3:28 AM

Updated:: 07/Sep/22 9:57 AM

Resolved:: 28/Jan/21 8:27 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-43924 Remove leader ID if no nodes remain: Gerrit Review:

Backup Service - Rebalance failure (leader address could not be fetched)

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty