Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58749

Orchestrator selection gone wrong during failover with network restart scenario

    XMLWordPrintable

Details

    • Untriaged
    • Linux x86_64
    • 0
    • Yes

    Description

      Steps:

      • 3 node KV cluster

        +----------------+---------+--------+-----------+----------+------------------------+
        | Nodes          | Zone    | CPU    | Mem_total | Mem_free | Swap_mem_used          |
        +----------------+---------+--------+-----------+----------+------------------------+
        | 172.23.97.199  | Group 1 | 0.4483 | 3.81 GiB  | 3.01 GiB | 13.25 MiB / 976.00 MiB |
        | 172.23.97.200  | Group 1 | 3.6319 | 3.81 GiB  | 2.96 GiB | 21.13 MiB / 976.00 MiB |
        | 172.23.121.117 | Group 1 | 0.4005 | 3.81 GiB  | 3.03 GiB | 13.50 MiB / 976.00 MiB |
        +----------------+---------+--------+-----------+----------+------------------------+
        

      • magma bucket with replicas=2

        +---------+-----------+---------+----------+------------+-------+----------+-----------+------------+
        | Bucket  | Type      | Storage | Replicas | Durability | Items | Vbuckets | RAM Quota | RAM Used   |
        +---------+-----------+---------+----------+------------+-------+----------+-----------+------------+
        | default | couchbase | magma   | 2        | none       | 37366 | 1024     | 8.84 GiB  | 449.49 MiB |
        +---------+-----------+---------+----------+------------+-------+----------+-----------+------------+
        

      • Set auto-failover timeout=5
      • Induce failure by restarting the error 'restart_network' on '172.23.97.199'

        Command used:
        service network stop && sleep {} && service network start

      Observation:

      Failover of node '172.23.97.199' was initiated by the current master node (172.23.97.200)

      Post failover, the orchestrator is reported as ".199" (the one got failed over itself)

       

      From .200 node's logs:

      [ns_server:debug,2023-09-20T01:54:00.662-07:00,ns_1@172.23.97.200:<0.853.0>:auto_failover:log_down_nodes_reason:403]Node 'ns_1@172.23.97.199' is considered down. Reason:"The cluster manager did not respond. "
      ...
      [ns_server:debug,2023-09-20T01:54:01.664-07:00,ns_1@172.23.97.200:<0.853.0>:auto_failover:log_down_nodes_reason:403]Node 'ns_1@172.23.97.199' is considered down. Reason:"All monitors report node is unhealthy."
      ...
      [ns_server:debug,2023-09-20T01:54:05.670-07:00,ns_1@172.23.97.200:<0.851.0>:failover:start:44]Starting failover with Nodes = ['ns_1@172.23.97.199'],
          Options = #{allow_unsafe => false,
                      auto => true,
                      down_nodes => ['ns_1@172.23.97.199'],
                      failover_reasons =>
                      [{'ns_1@172.23.97.199', "All monitors report node is unhealthy."}]}
      ...
      [user:info,2023-09-20T01:54:08.232-07:00,ns_1@172.23.97.200:<0.22237.0>:failover:deactivate_nodes:225]Deactivating failed over nodes ['ns_1@172.23.97.199']
      [ns_server:debug,2023-09-20T01:54:08.233-07:00,ns_1@172.23.97.200:<0.22237.0>:chronicle_master:call:71]Calling chronicle_master with {deactivate_nodes,['ns_1@172.23.97.199']}
      [ns_server:debug,2023-09-20T01:54:08.259-07:00,ns_1@172.23.97.200:<0.770.0>:chronicle_master:handle_oper:342]Starting kv operation {deactivate_nodes,['ns_1@172.23.97.199']} with lock <<"210e2cf5f3a001080e5904f6ed8b9eb0">>
      

      TAF test:

      failover.AutoFailoverTests.AutoFailoverTests:
           test_autofailover,timeout=5,num_node_failures=1,nodes_init=4,failover_action=restart_network,nodes_init=3,replicas=2,durability=MAJORITY_AND_PERSIST_TO_ACTIVE,num_items=50000 

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty