Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53559

[CBBS] Rebalance fails if previous leader was failed over

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.2.0
    • 7.1.0
    • tools
    • Untriaged
    • 1
    • Unknown

    Description

      What's the problem?

      When there is one backup service node in a cluster and it is rebalanced out it will delete itself as the /cbbs/leader in metakv. However, if this node is failed over then it will persist there, which will cause problems if later backup services nodes are added.

      What happens?

      When a new backup service node is added it will try to ask the old one if it has given up leadership. As this node was failed over this is probably going to fail, for example by not being reachable:

      ERROR (Rebalance) Could not confirm old leader stepped down {"err": "exhausted retry count after 5 attempts: could not confirm if old leader '14bc1e6bfcd881d16b3c717d34b780da' had stepped down: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp <ip>: connect: connection refused\""} 

      Any further attempt to rebalance will also fail. We've now set /cbbs/leader in metakv to this second backup service node, but due to the first rebalance failing it isn't in /cbbs/nodes, so we will see:

      ERROR (Rebalance) Could not confirm old leader stepped down {"err": "could not get old leader node address"} 

      What should happen?

      The design document states:

      If the previous leader node is still part of the cluster and it fails to confirm it has stepped down it will fail the rebalance.

      Actually checking whether the leader node is still part of the cluster and if not skipping the check should fix this. I believe it is also a sound change

      What's the workaround?

      Deleting /cbbs/leader from metakv and then triggering a rebalance should fix this

      curl -i http://localhost:8091/diag/eval -d 'metakv:delete(<<"/cbbs/leader">>).' -u <username>:<password> 

      Full reproduction

      1. VAGRANT_NODES=4 vagrant up
      2. Setup node1 with data
      3. Add node2 with only backup & rebalance
      4. sudo systemctl stop couchbase-server on node2
      5. Failover node2
      6. Add node3 & rebalance
        1. This will fail due to not being able to connect to node2 (i.e. it does find the node address in metakv)
      7. sudo systemctl stop couchbase-server on node3
      8. Failover node3 which will fail and offer to force a failover. Accept the forced failover
      9. Add node4 with only backup & rebalance
        1. This will fail due to not being able to find node3 in metakv

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              gilad.kalchheim Gilad Kalchheim
              Matt.Hall Matt Hall
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty