Details
-
Bug
-
Resolution: Not a Bug
-
Critical
-
7.6.0
-
enterprise :: 7.6.0-1525 linux_amd64
-
Untriaged
-
Linux x86_64
-
0
-
Yes
Description
Steps:
- 3 node KV cluster
+----------------+---------+--------+-----------+----------+------------------------+
| Nodes | Zone | CPU | Mem_total | Mem_free | Swap_mem_used |
+----------------+---------+--------+-----------+----------+------------------------+
| 172.23.97.199 | Group 1 | 0.4483 | 3.81 GiB | 3.01 GiB | 13.25 MiB / 976.00 MiB |
| 172.23.97.200 | Group 1 | 3.6319 | 3.81 GiB | 2.96 GiB | 21.13 MiB / 976.00 MiB |
| 172.23.121.117 | Group 1 | 0.4005 | 3.81 GiB | 3.03 GiB | 13.50 MiB / 976.00 MiB |
+----------------+---------+--------+-----------+----------+------------------------+
- magma bucket with replicas=2
+---------+-----------+---------+----------+------------+-------+----------+-----------+------------+
| Bucket | Type | Storage | Replicas | Durability | Items | Vbuckets | RAM Quota | RAM Used |
+---------+-----------+---------+----------+------------+-------+----------+-----------+------------+
| default | couchbase | magma | 2 | none | 37366 | 1024 | 8.84 GiB | 449.49 MiB |
+---------+-----------+---------+----------+------------+-------+----------+-----------+------------+
- Set auto-failover timeout=5
- Induce failure by restarting the error 'restart_network' on '172.23.97.199'
Command used:
service network stop && sleep {} && service network start
Observation:
Failover of node '172.23.97.199' was initiated by the current master node (172.23.97.200)
Post failover, the orchestrator is reported as ".199" (the one got failed over itself)
From .200 node's logs:
[ns_server:debug,2023-09-20T01:54:00.662-07:00,ns_1@172.23.97.200:<0.853.0>:auto_failover:log_down_nodes_reason:403]Node 'ns_1@172.23.97.199' is considered down. Reason:"The cluster manager did not respond. "
|
...
|
[ns_server:debug,2023-09-20T01:54:01.664-07:00,ns_1@172.23.97.200:<0.853.0>:auto_failover:log_down_nodes_reason:403]Node 'ns_1@172.23.97.199' is considered down. Reason:"All monitors report node is unhealthy."
|
...
|
[ns_server:debug,2023-09-20T01:54:05.670-07:00,ns_1@172.23.97.200:<0.851.0>:failover:start:44]Starting failover with Nodes = ['ns_1@172.23.97.199'],
|
Options = #{allow_unsafe => false,
|
auto => true,
|
down_nodes => ['ns_1@172.23.97.199'],
|
failover_reasons =>
|
[{'ns_1@172.23.97.199', "All monitors report node is unhealthy."}]}
|
...
|
[user:info,2023-09-20T01:54:08.232-07:00,ns_1@172.23.97.200:<0.22237.0>:failover:deactivate_nodes:225]Deactivating failed over nodes ['ns_1@172.23.97.199']
|
[ns_server:debug,2023-09-20T01:54:08.233-07:00,ns_1@172.23.97.200:<0.22237.0>:chronicle_master:call:71]Calling chronicle_master with {deactivate_nodes,['ns_1@172.23.97.199']}
|
[ns_server:debug,2023-09-20T01:54:08.259-07:00,ns_1@172.23.97.200:<0.770.0>:chronicle_master:handle_oper:342]Starting kv operation {deactivate_nodes,['ns_1@172.23.97.199']} with lock <<"210e2cf5f3a001080e5904f6ed8b9eb0">>
|
TAF test:
failover.AutoFailoverTests.AutoFailoverTests:
|
test_autofailover,timeout=5,num_node_failures=1,nodes_init=4,failover_action=restart_network,nodes_init=3,replicas=2,durability=MAJORITY_AND_PERSIST_TO_ACTIVE,num_items=50000
|