Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45110

[Chronicle] Cluster can get potentially stuck such that we may not be able to remove failed nodes out of the cluster

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Steps to Reproduce:
      1. Create a 5 node cluster: .137, .138, .139, .140, .142
      2. Stop-server on .140 and when the node becomes unresponsive, fail it over, but don't rebalance it out yet.

      3. Now stop server on .138, .139 nodes.

      Now it appears that we can't get the unresponsive nodes from steps 2 and 3 out of the cluster.
      We can't quorum failover .138 and .139 as we have another failed node: .140. So attempts to quorum failover will fail as 

      Unexpected server error: {error,
                                   {aborted,
                                       #{failed_peers =>
                                             ['ns_1@172.23.120.140',
      * Connection #0 to host 172.23.120.137 left intact
                                              'ns_1@172.23.120.138']}}}

      There should be a way to potentially avoid this situation of cluster getting permanently stuck with this problem.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            sumedh.basarkod Sumedh Basarkod (Inactive) created issue -
            sumedh.basarkod Sumedh Basarkod (Inactive) made changes -
            Field Original Value New Value
            Attachment Screenshot 2021-03-20 at 7.23.38 AM.png [ 131953 ]
            dfinlay Dave Finlay made changes -
            Assignee Dave Finlay [ dfinlay ] Abhijeeth Nuthan [ abhijeeth.nuthan ]
            Abhijeeth.Nuthan Abhijeeth Nuthan made changes -
            Link This issue blocks MB-45433 [ MB-45433 ]
            Abhijeeth.Nuthan Abhijeeth Nuthan made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            Abhijeeth.Nuthan Abhijeeth Nuthan made changes -
            Description *Steps to Reproduce:*
            1. Create a 5 node cluster: .137, .138, .139, .140, .142
            2. Stop-server on .140 and when the node becomes unresponsive, fail it over, but don't rebalance it out yet.

            3. Now stop server on .138, .139 nodes.

            Now it appears that we can't get the unresponsive nodes from steps 2 and 3 out of the cluster.
            We can't quorum failover .138 and .139 as we have another failed node: .140. So attempts to quorum failover will fail as 
            {noformat}
            Unexpected server error: {error,
                                         {aborted,
                                             #{failed_peers =>
                                                   ['ns_1@172.23.120.140',
            * Connection #0 to host 172.23.120.137 left intact
                                                    'ns_1@172.23.120.138']}}}{noformat}
            There should be a way to potentially avoid this situation of cluster getting permanently stuck with this problem.
            *Steps to Reproduce:*
             1. Create a 5 node cluster: .137, .138, .139, .140, .142
             2. Stop-server on .140 and when the node becomes unresponsive, fail it over, but don't rebalance it out yet.

            3. Now stop server on .138, .139 nodes.

            Now it appears that we can't get the unresponsive nodes from steps 2 and 3 out of the cluster.
             We can't quorum failover .138 and .139 as we have another failed node: .140. So attempts to quorum failover will fail as 
            {noformat}Unexpected server error: {error,
                                         {aborted,
                                             #{failed_peers =>
                                                   ['ns_1@172.23.120.140',
            * Connection #0 to host 172.23.120.137 left intact
                                                    'ns_1@172.23.120.138']}}}{noformat}
            There should be a way to potentially avoid this situation of cluster getting permanently stuck with this problem.
            dfinlay Dave Finlay made changes -
            Link This issue relates to MB-45462 [ MB-45462 ]
            sumedh.basarkod Sumedh Basarkod (Inactive) made changes -
            Assignee Abhijeeth Nuthan [ abhijeeth.nuthan ] Sumedh Basarkod [ sumedh.basarkod ]
            Status Resolved [ 5 ] Closed [ 6 ]
            sumedh.basarkod Sumedh Basarkod (Inactive) made changes -
            Labels affects-cc-testing ns_server affects-cc-testing functional-test ns_server
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              sumedh.basarkod Sumedh Basarkod (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty