Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45110

[Chronicle] Cluster can get potentially stuck such that we may not be able to remove failed nodes out of the cluster

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Steps to Reproduce:
      1. Create a 5 node cluster: .137, .138, .139, .140, .142
      2. Stop-server on .140 and when the node becomes unresponsive, fail it over, but don't rebalance it out yet.

      3. Now stop server on .138, .139 nodes.

      Now it appears that we can't get the unresponsive nodes from steps 2 and 3 out of the cluster.
      We can't quorum failover .138 and .139 as we have another failed node: .140. So attempts to quorum failover will fail as 

      Unexpected server error: {error,
                                   {aborted,
                                       #{failed_peers =>
                                             ['ns_1@172.23.120.140',
      * Connection #0 to host 172.23.120.137 left intact
                                              'ns_1@172.23.120.138']}}}

      There should be a way to potentially avoid this situation of cluster getting permanently stuck with this problem.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Build couchbase-server-7.0.0-4845 contains ns_server commit 8dcc01d with commit message:
            MB-45110: Allow unsafe failover of inactive nodes

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4845 contains ns_server commit 8dcc01d with commit message: MB-45110 : Allow unsafe failover of inactive nodes

            Hi Dave Finlay, Abhijeeth Nuthan,
            Do you think we should support this with CLI as well (to be consistent with REST and UI)? Because, I tried this with CLI and it failed with "ERROR: Can't failover a node that isn't in the cluster" 

            sumedh.basarkod Sumedh Basarkod added a comment - Hi Dave Finlay , Abhijeeth Nuthan , Do you think we should support this with CLI as well (to be consistent with REST and UI)? Because, I tried this with CLI and it failed with "ERROR: Can't failover a node that isn't in the cluster" 
            dfinlay Dave Finlay added a comment - - edited

            Hi Sumedh, yes we should. I'll create a ticket for it: MB-45462

            dfinlay Dave Finlay added a comment - - edited Hi Sumedh, yes we should. I'll create a ticket for it: MB-45462

            Sure, thanks Dave

            sumedh.basarkod Sumedh Basarkod added a comment - Sure, thanks Dave

            Verified on 7.0.0-4860. Closing this

            sumedh.basarkod Sumedh Basarkod added a comment - Verified on 7.0.0-4860. Closing this

            People

              sumedh.basarkod Sumedh Basarkod
              sumedh.basarkod Sumedh Basarkod
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty