Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45232

Failed to add node back into cluster after node_failover and add back case

    XMLWordPrintable

Details

    Description

      Build: 7.0.0-4735

      Scenario:

      • 4 node cluster, 2 Couchbase buckets

        +---------------+----------+-------+------------+------------+------------------------+-------------------+
        | Node          | Services |  CPU  | Mem_total  | Mem_free   | Swap_mem_used          | Active / Replica  |
        +---------------+----------+-------+------------+------------+------------------------+-------------------+
        | 172.23.100.13 | kv       | 58.53 | 4201684992 | 3464732672 | 91488256 / 3758092288  | 3781 / 7529       |
        | 172.23.100.12 | kv       | 58.72 | 4201689088 | 3418853376 | 110215168 / 3758092288 | 3781 / 7800       |
        | 172.23.100.18 | kv       | 59.69 | 4201684992 | 3461459968 | 102236160 / 3758092288 | 3783 / 7491       |
        | 172.23.100.19 | kv       | 59.50 | 4201684992 | 3445219328 | 115343360 / 3758092288 | 3785 / 7438       |
        +---------------+----------+-------+------------+------------+------------------------+-------------------++--------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+
        | Bucket       | Type      | Replicas | Durability | TTL | Items | RAM Quota  | RAM Used  | Disk Used |
        +--------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+
        | couchstore.0 | couchbase | 2        | none       | 0   | 7235  | 2986344448 | 98909336  | 424106110 |
        | couchstore.1 | couchbase | 2        | none       | 0   | 7895  | 2986344448 | 101506488 | 513393138 |
        +--------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+
        

      • Load initial 10K docs into both the buckets through Transaction
      • Failover 172.23.100.12 node (Id 02b2ab7412476b02141e1cff571d37ad success)
      • Failover 172.23.100.18 node (Id 6e28ce7a0762769cd03679e696ebd2de success)
      • Eject .12 and .18 using rebalance-out operation (Id b97763b901d7c36e43737d51d7533a7f)
      • Try adding back .12 node

      Observation:

      Adding back node 172.23.100.12 failed with reason,

      Failed to connect to http://172.23.100.12:8091. Could not connect to "172.23.100.12" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers.

      172.23.100.13 - ns_server.info.log

      [user:info,2021-03-24T07:03:33.272-07:00,ns_1@172.23.100.13:<0.29517.0>:ns_cluster:add_node_to_group:89]Failed to add node 172.23.100.12:8091 to cluster. Failed to connect to http://172.23.100.12:8091. Could not connect to "172.23.100.12" on port 8091.  This could be due to an incorrect host/port combination or a firewall in place between the servers.

      172.23.100.12 - ns_server.inf.o.log

      [ns_server:info,2021-03-24T07:03:33.022-07:00,ns_1@172.23.100.12:ns_couchdb_port<0.22809.1>:ns_port_server:log:224]ns_couchdb<0.22809.1>: 2021-03-24 07:03:32.818687
      ns_couchdb<0.22809.1>:     args: ['ns_1@172.23.100.12']
      ns_couchdb<0.22809.1>:     format: "** Connection attempt from disallowed node ~w ** ~n"
      ns_couchdb<0.22809.1>:     label: {error_logger,error_msg}[ns_server:warn,2021-03-24T07:03:33.023-07:00,ns_1@172.23.100.12:memcached_refresh<0.22736.1>:ns_memcached:connect:1207]Unable to connect: {error,{badmatch,[{inet,{error,econnrefused}}]}}.
      [ns_server:info,2021-03-24T07:03:33.228-07:00,ns_1@172.23.100.12:ns_couchdb_port<0.22809.1>:ns_port_server:log:224]ns_couchdb<0.22809.1>: 2021-03-24 07:03:33.027905
      ns_couchdb<0.22809.1>:     args: ['ns_1@172.23.100.12']
      ns_couchdb<0.22809.1>:     format: "** Connection attempt from disallowed node ~w ** ~n"
      ns_couchdb<0.22809.1>:     label: {error_logger,error_msg}
      

      TAF test:

      rebalance_new.swaprebalancetests.SwapRebalanceFailedTests:
          test_add_back_failed_node,doc_size=512,standard_buckets=2,transaction_timeout=150,swap-orchestrator=True,num-swap=2,atomicity=True,replicas=2,durability=PERSIST_TO_MAJORITY,nodes_init=4,num_items=10000

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            This is just happen to be one concrete use case related to MB-43381. So probably a DUP, but since it contains logs and info, we'll leave it open to be verified once the other ticket is addressed.

            meni.hillel Meni Hillel (Inactive) added a comment - This is just happen to be one concrete use case related to MB-43381 . So probably a DUP, but since it contains logs and info, we'll leave it open to be verified once the other ticket is addressed.
            meni.hillel Meni Hillel (Inactive) added a comment - DUP to MB-45232

            Correction: DUP to MB-45016 due to the similarities : we've actually removed the orchestrator node and Chronical does not detect that only till a bit later.

            Specific note from Dave on that ticket : "The asynchronous realization on the part of the node being removed that it has actually been removed is by design. It helps us avoid races where the node being removed may be responsible for removing itself from the chronicle config on the cluster from which it's departing. However, perhaps we can do better than 20s."

            meni.hillel Meni Hillel (Inactive) added a comment - Correction: DUP to MB-45016 due to the similarities : we've actually removed the orchestrator node and Chronical does not detect that only till a bit later. Specific note from Dave on that ticket : "The asynchronous realization on the part of the node being removed that it has actually been removed is by design. It helps us avoid races where the node being removed may be responsible for removing itself from the chronicle config on the cluster from which it's departing. However, perhaps we can do better than 20s."
            dfinlay Dave Finlay added a comment -

            Thanks Meni - yes this is a race. The test attempts to re-add the node before it's ready due to the delay in the node recognizing that it's been removed from the cluster. The following was on an email exchange between Meni and me detailing what happened in this ticket:

            The orchestrator leaving the cluster. .12 is removing itself along with .18.

            2021-03-24T07:03:07.646-07:00, ns_orchestrator:0:info:message(ns_1@172.23.100.12) - Starting rebalance, KeepNodes = ['ns_1@172.23.100.13','ns_1@172.23.100.19'], EjectNodes = [], Failed over and being ejected nodes = ['ns_1@172.23.100.12','ns_1@172.23.100.18']; no delta recovery nodes; Operation Id = b97763b901d7c36e43737d51d7533a7f
            ...
            2021-03-24T07:03:19.662-07:00, ns_orchestrator:0:info:message(ns_1@172.23.100.12) - Rebalance completed successfully.
            Rebalance Operation Id = b97763b901d7c36e43737d51d7533a7f
            

            However .12 doesn’t actually leave (i.e. clean itself up) until 13 seconds after the rebalance:

            2021-03-24T07:03:31.879-07:00, ns_cluster:1:info:message(ns_1@172.23.100.12) - Node 'ns_1@172.23.100.12' is leaving cluster.
            2021-03-24T07:03:31.993-07:00, ns_node_disco:5:warning:node down(ns_1@172.23.100.13) - Node 'ns_1@172.23.100.13' saw that node 'ns_1@172.23.100.12' went down. Details: [{nodedown_reason, connection_closed}]
            2021-03-24T07:03:31.995-07:00, ns_node_disco:5:warning:node down(ns_1@172.23.100.19) - Node 'ns_1@172.23.100.19' saw that node 'ns_1@172.23.100.12' went down. Details: [{nodedown_reason, connection_closed}]
            

            During this 13 seconds there’s quite a lot of chaos in the ns_server logs as ns_config configuration starts to disappear.

            Then finally the wipe happens and the node removes itself, cleans itself up and starts the web servers:

            [ns_server:info,2021-03-24T07:03:34.300-07:00,ns_1@172.23.100.12:<0.22943.1>:menelaus_web:maybe_start_http_server:115]Started web service with [{ip,"0.0.0.0"},{port,8091}]
            

            Meanwhile the node addition was tried 1 s before the web server came up:

             
            [user:info,2021-03-24T07:03:33.272-07:00,ns_1@172.23.100.13:<0.29517.0>:ns_cluster:add_node_to_group:89]Failed to add node 172.23.100.12:8091 to cluster. Failed to connect to http://172.23.100.12:8091. Could not connect to "172.23.100.12" on port 8091.  This could be due to an incorrect host/port combination or a firewall in place between the servers.
            

            dfinlay Dave Finlay added a comment - Thanks Meni - yes this is a race. The test attempts to re-add the node before it's ready due to the delay in the node recognizing that it's been removed from the cluster. The following was on an email exchange between Meni and me detailing what happened in this ticket: The orchestrator leaving the cluster. .12 is removing itself along with .18. 2021-03-24T07:03:07.646-07:00, ns_orchestrator:0:info:message(ns_1@172.23.100.12) - Starting rebalance, KeepNodes = ['ns_1@172.23.100.13','ns_1@172.23.100.19'], EjectNodes = [], Failed over and being ejected nodes = ['ns_1@172.23.100.12','ns_1@172.23.100.18']; no delta recovery nodes; Operation Id = b97763b901d7c36e43737d51d7533a7f ... 2021-03-24T07:03:19.662-07:00, ns_orchestrator:0:info:message(ns_1@172.23.100.12) - Rebalance completed successfully. Rebalance Operation Id = b97763b901d7c36e43737d51d7533a7f However .12 doesn’t actually leave (i.e. clean itself up) until 13 seconds after the rebalance: 2021-03-24T07:03:31.879-07:00, ns_cluster:1:info:message(ns_1@172.23.100.12) - Node 'ns_1@172.23.100.12' is leaving cluster. 2021-03-24T07:03:31.993-07:00, ns_node_disco:5:warning:node down(ns_1@172.23.100.13) - Node 'ns_1@172.23.100.13' saw that node 'ns_1@172.23.100.12' went down. Details: [{nodedown_reason, connection_closed}] 2021-03-24T07:03:31.995-07:00, ns_node_disco:5:warning:node down(ns_1@172.23.100.19) - Node 'ns_1@172.23.100.19' saw that node 'ns_1@172.23.100.12' went down. Details: [{nodedown_reason, connection_closed}] During this 13 seconds there’s quite a lot of chaos in the ns_server logs as ns_config configuration starts to disappear. Then finally the wipe happens and the node removes itself, cleans itself up and starts the web servers: [ns_server:info,2021-03-24T07:03:34.300-07:00,ns_1@172.23.100.12:<0.22943.1>:menelaus_web:maybe_start_http_server:115]Started web service with [{ip,"0.0.0.0"},{port,8091}] Meanwhile the node addition was tried 1 s before the web server came up: [user:info,2021-03-24T07:03:33.272-07:00,ns_1@172.23.100.13:<0.29517.0>:ns_cluster:add_node_to_group:89]Failed to add node 172.23.100.12:8091 to cluster. Failed to connect to http://172.23.100.12:8091. Could not connect to "172.23.100.12" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers.

            Closing since this is a duplicate.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing since this is a duplicate.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty