Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45110

[Chronicle] Cluster can get potentially stuck such that we may not be able to remove failed nodes out of the cluster

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Steps to Reproduce:
      1. Create a 5 node cluster: .137, .138, .139, .140, .142
      2. Stop-server on .140 and when the node becomes unresponsive, fail it over, but don't rebalance it out yet.

      3. Now stop server on .138, .139 nodes.

      Now it appears that we can't get the unresponsive nodes from steps 2 and 3 out of the cluster.
      We can't quorum failover .138 and .139 as we have another failed node: .140. So attempts to quorum failover will fail as 

      Unexpected server error: {error,
                                   {aborted,
                                       #{failed_peers =>
                                             ['ns_1@172.23.120.140',
      * Connection #0 to host 172.23.120.137 left intact
                                              'ns_1@172.23.120.138']}}}

      There should be a way to potentially avoid this situation of cluster getting permanently stuck with this problem.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-45110
          # Subject Branch Project Status CR V

          Activity

            sumedh.basarkod Sumedh Basarkod (Inactive) created issue -
            sumedh.basarkod Sumedh Basarkod (Inactive) made changes -
            Field Original Value New Value
            Attachment Screenshot 2021-03-20 at 7.23.38 AM.png [ 131953 ]
            dfinlay Dave Finlay made changes -
            Assignee Dave Finlay [ dfinlay ] Abhijeeth Nuthan [ abhijeeth.nuthan ]
            Abhijeeth.Nuthan Abhijeeth Nuthan made changes -
            Link This issue blocks MB-45433 [ MB-45433 ]
            Abhijeeth.Nuthan Abhijeeth Nuthan made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]

            Build couchbase-server-7.0.0-4845 contains ns_server commit 8dcc01d with commit message:
            MB-45110: Allow unsafe failover of inactive nodes

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4845 contains ns_server commit 8dcc01d with commit message: MB-45110 : Allow unsafe failover of inactive nodes
            Abhijeeth.Nuthan Abhijeeth Nuthan made changes -
            Description *Steps to Reproduce:*
            1. Create a 5 node cluster: .137, .138, .139, .140, .142
            2. Stop-server on .140 and when the node becomes unresponsive, fail it over, but don't rebalance it out yet.

            3. Now stop server on .138, .139 nodes.

            Now it appears that we can't get the unresponsive nodes from steps 2 and 3 out of the cluster.
            We can't quorum failover .138 and .139 as we have another failed node: .140. So attempts to quorum failover will fail as 
            {noformat}
            Unexpected server error: {error,
                                         {aborted,
                                             #{failed_peers =>
                                                   ['ns_1@172.23.120.140',
            * Connection #0 to host 172.23.120.137 left intact
                                                    'ns_1@172.23.120.138']}}}{noformat}
            There should be a way to potentially avoid this situation of cluster getting permanently stuck with this problem.
            *Steps to Reproduce:*
             1. Create a 5 node cluster: .137, .138, .139, .140, .142
             2. Stop-server on .140 and when the node becomes unresponsive, fail it over, but don't rebalance it out yet.

            3. Now stop server on .138, .139 nodes.

            Now it appears that we can't get the unresponsive nodes from steps 2 and 3 out of the cluster.
             We can't quorum failover .138 and .139 as we have another failed node: .140. So attempts to quorum failover will fail as 
            {noformat}Unexpected server error: {error,
                                         {aborted,
                                             #{failed_peers =>
                                                   ['ns_1@172.23.120.140',
            * Connection #0 to host 172.23.120.137 left intact
                                                    'ns_1@172.23.120.138']}}}{noformat}
            There should be a way to potentially avoid this situation of cluster getting permanently stuck with this problem.

            Hi Dave Finlay, Abhijeeth Nuthan,
            Do you think we should support this with CLI as well (to be consistent with REST and UI)? Because, I tried this with CLI and it failed with "ERROR: Can't failover a node that isn't in the cluster" 

            sumedh.basarkod Sumedh Basarkod (Inactive) added a comment - Hi Dave Finlay , Abhijeeth Nuthan , Do you think we should support this with CLI as well (to be consistent with REST and UI)? Because, I tried this with CLI and it failed with "ERROR: Can't failover a node that isn't in the cluster" 
            dfinlay Dave Finlay added a comment - - edited

            Hi Sumedh, yes we should. I'll create a ticket for it: MB-45462

            dfinlay Dave Finlay added a comment - - edited Hi Sumedh, yes we should. I'll create a ticket for it: MB-45462

            Sure, thanks Dave

            sumedh.basarkod Sumedh Basarkod (Inactive) added a comment - Sure, thanks Dave
            dfinlay Dave Finlay made changes -
            Link This issue relates to MB-45462 [ MB-45462 ]

            Verified on 7.0.0-4860. Closing this

            sumedh.basarkod Sumedh Basarkod (Inactive) added a comment - Verified on 7.0.0-4860. Closing this
            sumedh.basarkod Sumedh Basarkod (Inactive) made changes -
            Assignee Abhijeeth Nuthan [ abhijeeth.nuthan ] Sumedh Basarkod [ sumedh.basarkod ]
            Status Resolved [ 5 ] Closed [ 6 ]
            sumedh.basarkod Sumedh Basarkod (Inactive) made changes -
            Labels affects-cc-testing ns_server affects-cc-testing functional-test ns_server
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              sumedh.basarkod Sumedh Basarkod (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty