Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43837

[Chronicle] Node doesn't get cleaned up after rebalance-out

    XMLWordPrintable

Details

    Description

      Summary:
      Adding node after it was rebalanced-out failed due to error "Prepare join failed. Node is already part of cluster." This could be due to the reason that node did not get cleaned up after it was rebalanced out

      Steps to Reproduce and timeline
      During the volume test there is a step to (on a node)
      induce_firewall -> autofailover -> remove_firewall -> rebalance_out -> add_back_the_node_again

      1. Induce firewall on .233

       

      2021-01-24 19:20:34,806 | test | INFO | MainThread | [Collections_autofailover:rebalance_after_autofailover:103] Inducing failure firewall on nodes: [ip:172.23.106.233 port:8091 ssh_username:root]
      

       

       

      2.  .233 gets failed-over

       

      2021-01-24 19:21:40,184 | test | ERROR | pool-1-thread-30 | [rest_client:print_UI_logs:2595] {u'code': 0, u'module': u'failover', u'type': u'info', u'node': u'ns_1@172.23.105.175', u'tstamp': 1611544900094L, u'shortText': u'message', u'serverTime': u'2021-01-24T19:21:40.094Z', u'text': u"Starting failing over ['ns_1@172.23.106.233']"}
      

       

       

      3.  Remove firewall and rebalance-out
      2021-01-24 19:27:28,275 | test | INFO | pool-1-thread-17 | [table_view:display:72] Rebalance Overview
      ------------------------------------

      Nodes Services Status

      ------------------------------------

      172.23.105.175 kv Cluster node
      172.23.106.250 kv Cluster node
      172.23.106.236 kv Cluster node
      172.23.106.251 kv Cluster node
      172.23.106.238 kv Cluster node

      ------------------------------------

      2021-01-24 19:27:43,392 | test | INFO | pool-1-thread-17 | [task:check:322] Rebalance - status: none, progress: 100
      2021-01-24 19:28:13,433 | test | ERROR | pool-1-thread-17 | [task:check:374] Node 172.23.106.233:8091 was not cleaned after removing from cluster

      4. Add back the node

      2021-01-24 19:28:13,529 | test  | INFO    | pool-1-thread-1 | [rest_client:print_UI_logs:2593] Latest logs from UI on 172.23.105.175:
      2021-01-24 19:28:13,529 | test  | ERROR   | pool-1-thread-1 | [rest_client:print_UI_logs:2595] {u'code': 5, u'module': u'ns_cluster', u'type': u'info', u'node': u'ns_1@172.23.105.175', u'tstamp': 1611545293485L, u'shortText': u'message', u'serverTime': u'2021-01-24T19:28:13.485Z', u'text': u'Failed to add node 172.23.106.233:8091 to cluster. Prepare join failed. Node is already part of cluster.'}

      Observations
      Checking the UI on .175, we see that .233 is not a part of the cluster
      But checking the UI on .233 we see that .233 is already a part of the cluster

      on .175 ns_server.debug.log

      [ns_server:warn,2021-01-24T19:28:13.720-08:00,ns_1@172.23.105.175:mb_master<0.3211.0>:mb_master:master:493]Master got candidate heartbeat from node 'ns_1@172.23.106.233' which is not in peers ['ns_1@172.23.105.175',
                                                                                            'ns_1@172.23.106.236',
                                                                                            'ns_1@172.23.106.238',
                                                                                            'ns_1@172.23.106.250',
                                                                                            'ns_1@172.23.106.251']
      

      [ns_server:debug,2021-01-24T19:28:13.721-08:00,ns_1@172.23.105.175:ns_server_monitor<0.719.0>:health_monitor:handle_cast:82]Ignoring heartbeat from an unknown node 'ns_1@172.23.106.233'
      

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            dfinlay Dave Finlay added a comment -

            This may be addressed when http://review.couchbase.org/c/ns_server/+/143449 and related patches get merged.

            dfinlay Dave Finlay added a comment - This may be addressed when http://review.couchbase.org/c/ns_server/+/143449 and related patches get merged.

            duplicate of MB-43899

            artem Artem Stemkovski added a comment - duplicate of MB-43899

            Closing all invalid, duplicate and won't fix issues

            raju Raju Suravarjjala added a comment - Closing all invalid, duplicate and won't fix issues

            People

              artem Artem Stemkovski
              sumedh.basarkod Sumedh Basarkod
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty