Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43924

Backup Service - Rebalance failure (leader address could not be fetched)

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • tools
    • Enterprise Edition 7.0.0 build 4321

    Description

      Description:

      Observed a rebalance failure where a worker was terminated abnormally and the old leader address could not be fetched.

      Steps to reproduce:

      Initial setup:

      The initial cluster setup is unknown, but perhaps it can be extracted from the logs (as they're quite short).

      The cluster setup when the rebalance failure happens:

      3 node cluster. Data service only on node 1. All services apart from analytics on nodes 2 and 3.

      I believe the the test code attempted to add nodes 2 and 3 to node 1 and perform a rebalance.

      What happens: 

      The rebalance fails with the following extracts from the logs:

       

      From the backup service logs on node 10.112.210.103:

      cbcollect_info_ns_1@10.112.210.103_20210128-104243/ns_server.backup_service.log

      2021-01-28T10:37:10.665Z    INFO    (Rebalance) Got old leader  {"leader": "af16ada7d4774db2b75b6b7d8613f6bb"}
      2021-01-28T10:37:10.668Z    INFO    (Rebalance) Got current nodes   {"#nodes": 0}
      2021-01-28T10:37:10.668Z    INFO    (Rebalance) Setting self as leader
      2021-01-28T10:37:10.672Z    INFO    (Rebalance) Checking that old leader stepped down
      2021-01-28T10:37:10.672Z    ERROR   (Rebalance) Could not confirm old leader stepped down   {"err": "could not get old leader node address"}
      2021-01-28T10:37:10.672Z    INFO    (Rebalance) Rebalance done  {"err": "could not get old leader node address", "state": {}, "cancelled": false}
      

       

      From the ns server error logs on node 10.112.210.103:

      cbcollect_info_ns_1@10.112.210.101_20210128-104244/ns_server.error.log

      [ns_server:error,2021-01-28T10:37:12.257Z,ns_1@10.112.210.101:service_rebalancer-backup<0.1795.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.1810.1>,
                                     {rebalance_failed,
                                      {service_error,
                                       <<"could not get old leader node address">>}}}
      [user:error,2021-01-28T10:37:12.261Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
                                    {worker_died,
                                     {'EXIT',<0.1810.1>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"could not get old leader node address">>}}}}}.
      Rebalance Operation Id = 8eff1d74c5f02dcb8c99ab6ad942c5f7
      [ns_server:error,2021-01-28T10:38:35.952Z,ns_1@10.112.210.101:service_rebalancer-backup<0.2546.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.2561.1>,
                                     {rebalance_failed,
                                      {service_error,
                                       <<"could not confirm self removed: could not remove self: node status is not out">>}}}
      [user:error,2021-01-28T10:38:35.955Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
                                    {worker_died,
                                     {'EXIT',<0.2561.1>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"could not confirm self removed: could not remove self: node status is not out">>}}}}}.
      Rebalance Operation Id = cef5852ca9c67688e2a736950464792b
      [ns_server:error,2021-01-28T10:39:55.905Z,ns_1@10.112.210.101:service_rebalancer-backup<0.6605.1>:service_rebalancer:run_rebalance_worker:125]Worker terminated abnormally: {'EXIT',<0.6618.1>,
                                     {rebalance_failed,
                                      {service_error,
                                       <<"could not confirm self removed: could not remove self: node status is not out">>}}}
      [user:error,2021-01-28T10:39:55.907Z,ns_1@10.112.210.101:<0.16734.0>:ns_orchestrator:log_rebalance_completion:1402]Rebalance exited with reason {service_rebalance_failed,backup,
                                    {worker_died,
                                     {'EXIT',<0.6618.1>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"could not confirm self removed: could not remove self: node status is not out">>}}}}}.
      

      What I expected to happen: 

      I expected the rebalance to succeed

       Logs:

      [^collectinfo-2021-01-28T104244-ns_1@10.112.210.101.zip]
      [^collectinfo-2021-01-28T104244-ns_1@10.112.210.102.zip]
      [^collectinfo-2021-01-28T104244-ns_1@10.112.210.103.zip]

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Steps to reproduce this issue

          1. Start a 3 couchbase nodes
          2. Add node0 with data and node1 with backup and do a rebalance
          3. Remove node1 and do a rebalance
          4. Add node2 with backup and do a rebalance

          This will give the error above. The issue is that when the last backup node is removed it does not clean up the metakv LeaderID data which means that when a new node is added it will try and communicate with the previous leader which is no longer part of the cluster.

          Fortunately this should be easy to fix. Note this behaviour makes for a very good test

          carlos.gonzalez Carlos Gonzalez Betancort (Inactive) added a comment - - edited Steps to reproduce this issue Start a 3 couchbase nodes Add node0 with data and node1 with backup and do a rebalance Remove node1 and do a rebalance Add node2 with backup and do a rebalance This will give the error above. The issue is that when the last backup node is removed it does not clean up the metakv LeaderID data which means that when a new node is added it will try and communicate with the previous leader which is no longer part of the cluster. Fortunately this should be easy to fix. Note this behaviour makes for a very good test

          ACK, sounds like a good test.

          asad.zaidi Asad Zaidi (Inactive) added a comment - ACK, sounds like a good test.

          Build couchbase-server-7.0.0-4332 contains cbbs commit 052d601 with commit message:
          MB-43924 Remove leader ID if no nodes remain

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4332 contains cbbs commit 052d601 with commit message: MB-43924 Remove leader ID if no nodes remain

          Closing, followed the steps to reproduce and the rebalance succeeded.

          asad.zaidi Asad Zaidi (Inactive) added a comment - Closing, followed the steps to reproduce and the rebalance succeeded.

          People

            carlos.gonzalez Carlos Gonzalez Betancort (Inactive)
            asad.zaidi Asad Zaidi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty