Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45110

[Chronicle] Cluster can get potentially stuck such that we may not be able to remove failed nodes out of the cluster

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Steps to Reproduce:
      1. Create a 5 node cluster: .137, .138, .139, .140, .142
      2. Stop-server on .140 and when the node becomes unresponsive, fail it over, but don't rebalance it out yet.

      3. Now stop server on .138, .139 nodes.

      Now it appears that we can't get the unresponsive nodes from steps 2 and 3 out of the cluster.
      We can't quorum failover .138 and .139 as we have another failed node: .140. So attempts to quorum failover will fail as 

      Unexpected server error: {error,
                                   {aborted,
                                       #{failed_peers =>
                                             ['ns_1@172.23.120.140',
      * Connection #0 to host 172.23.120.137 left intact
                                              'ns_1@172.23.120.138']}}}

      There should be a way to potentially avoid this situation of cluster getting permanently stuck with this problem.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-45110
          # Subject Branch Project Status CR V

          Activity

            No work has yet been logged on this issue.

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              sumedh.basarkod Sumedh Basarkod (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty