Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43924

Backup Service - Rebalance failure (leader address could not be fetched)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.0.0
    • Cheshire-Cat
    • tools
    • Enterprise Edition 7.0.0 build 4321

    Description

      Description:

      Observed a rebalance failure where a worker was terminated abnormally and the old leader address could not be fetched.

      Steps to reproduce:

      Initial setup:

      The initial cluster setup is unknown, but perhaps it can be extracted from the logs (as they're quite short).

      The cluster setup when the rebalance failure happens:

      3 node cluster. Data service only on node 1. All services apart from analytics on nodes 2 and 3.

      I believe the the test code attempted to add nodes 2 and 3 to node 1 and perform a rebalance.

      What happens: 

      The rebalance fails with the following extracts from the logs:

       

      From the backup service logs on node 10.112.210.103:

      cbcollect_info_ns_1@10.112.210.103_20210128-104243/ns_server.backup_service.log

      2021-01-28T10:37:10.665Z    INFO    (Rebalance) Got old leader  {"leader": "af16ada7d4774db2b75b6b7d8613f6bb"}
      2021-01-28T10:37:10.668Z    INFO    (Rebalance) Got current nodes   {"#nodes": 0}
      2021-01-28T10:37:10.668Z    INFO    (Rebalance) Setting self as leader
      2021-01-28T10:37:10.672Z    INFO    (Rebalance) Checking that old leader stepped down
      2021-01-28T10:37:10.672Z    ERROR   (Rebalance) Could not confirm old leader stepped down   {"err": "could not get old leader node address"}
      2021-01-28T10:37:10.672Z    INFO    (Rebalance) Rebalance done  {"err": "could not get old leader node address", "state": {}, "cancelled": false}
      

       

      From the ns server error logs on node 10.112.210.103:

      cbcollect_info_ns_1@10.112.210.101_20210128-104244/ns_server.error.log

      [ns_server:error,2021-01-28T10:37:12.257Z,ns_1@10.112.210.101:service_rebalancer-backup<0.1795.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.1810.1>,
                                     {rebalance_failed,
                                      {service_error,
                                       <<"could not get old leader node address">>}}}
      [user:error,2021-01-28T10:37:12.261Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
                                    {worker_died,
                                     {'EXIT',<0.1810.1>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"could not get old leader node address">>}}}}}.
      Rebalance Operation Id = 8eff1d74c5f02dcb8c99ab6ad942c5f7
      [ns_server:error,2021-01-28T10:38:35.952Z,ns_1@10.112.210.101:service_rebalancer-backup<0.2546.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.2561.1>,
                                     {rebalance_failed,
                                      {service_error,
                                       <<"could not confirm self removed: could not remove self: node status is not out">>}}}
      [user:error,2021-01-28T10:38:35.955Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
                                    {worker_died,
                                     {'EXIT',<0.2561.1>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"could not confirm self removed: could not remove self: node status is not out">>}}}}}.
      Rebalance Operation Id = cef5852ca9c67688e2a736950464792b
      [ns_server:error,2021-01-28T10:39:55.905Z,ns_1@10.112.210.101:service_rebalancer-backup<0.6605.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.6618.1>,
                                     {rebalance_failed,
                                      {service_error,
                                       <<"could not confirm self removed: could not remove self: node status is not out">>}}}
      [user:error,2021-01-28T10:39:55.907Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
                                    {worker_died,
                                     {'EXIT',<0.6618.1>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"could not confirm self removed: could not remove self: node status is not out">>}}}}}.
      

      What I expected to happen: 

      I expected the rebalance to succeed

       Logs:

      [^collectinfo-2021-01-28T104244-ns_1@10.112.210.101.zip]
      [^collectinfo-2021-01-28T104244-ns_1@10.112.210.102.zip]
      [^collectinfo-2021-01-28T104244-ns_1@10.112.210.103.zip]

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              carlos.gonzalez Carlos Gonzalez Betancort (Inactive)
              asad.zaidi Asad Zaidi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty