Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
7.6.2
-
Enterprise Edition 7.6.2 build 3674
-
Untriaged
-
-
0
-
Unknown
Description
steps
1. create a 6 node cluster
172.23.104.235, 172.23.104.241, 172.23.104.250, 172.23.136.103, 172.23.136.104, 172.23.96.197
|
2. enable autofailover with following settings
timeout : 90 , max events: 5
3. enable some delay in the autofailover on .197
curl -k https://Administrator:password@localhost:18091/diag/eval -X POST -d 'testconditions:set(failover_end, {delay, 120000}).'
|
4. bring down data node to trigger autofailover for .241
ns_server:info,2024-06-09T22:39:53.095-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:handle_event:670]Skipping janitor in state rebalancing
|
[user:info,2024-06-09T22:39:56.544-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:log_rebalance_completion:1661]Failover completed successfully.
|
Rebalance Operation Id = 80a2b06c3374f8c4b7593f1629673562
|
[ns_server:info,2024-06-09T22:39:56.602-07:00,ns_1@172.23.96.197:leader_registry<0.887.0>:leader_registry:handle_down:286]Process <0.17680.310> registered as 'ns_rebalance_observer' terminated.
|
[user:info,2024-06-09T22:39:56.603-07:00,ns_1@172.23.96.197:<0.1538.0>:auto_failover:log_failover_success:662]Node ('ns_1@172.23.104.241') was automatically failed over. Reason: All monitors report node is unhealthy.
|
[user:info,2024-06-09T22:39:56.969-07:00,ns_1@172.23.96.197:<0.25807.310>:failover:orchestrate:172]Starting failing over ['ns_1@172.23.136.103']
|
5. bring down another node 172.23.136.103 while failover is getting triggered but quickly bring it up again before 90 seconds
[chronicle:info,2024-06-09T22:38:01.488-07:00,ns_1@172.23.96.197:chronicle_proposer<0.10208.162>:chronicle_proposer:handle_down:1142]Observed agent {chronicle_agent,'ns_1@172.23.136.103'} on peer 'ns_1@172.23.136.103' go down with reason noconnection
|
[user:warn,2024-06-09T22:38:01.488-07:00,ns_1@172.23.96.197:ns_node_disco<0.598.0>:ns_node_disco:handle_info:169]Node 'ns_1@172.23.96.197' saw that node 'ns_1@172.23.136.103' went down. Details: [{nodedown_reason,
|
connection_closed}]
|
[chronicle:info,2024-06-09T22:38:01.489-07:00,ns_1@172.23.96.197:chronicle_proposer<0.10208.162>:chronicle_proposer:handle_nodedown:1135]Peer 'ns_1@172.23.136.103' went down: [{nodedown_reason,connection_closed}]
|
[ns_server:info,2024-06-09T22:38:01.489-07:00,ns_1@172.23.96.197:ns_node_disco_events<0.596.0>:ns_node_disco_log:handle_event:40]ns_node_disco_log: nodes changed: ['ns_1@172.23.104.235',
|
'ns_1@172.23.104.250',
|
'ns_1@172.23.136.104','ns_1@172.23.96.197']
|
[ns_server:info,2024-06-09T22:38:03.073-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:handle_event:670]Skipping janitor in state rebalancing
|
came up logs
[user:info,2024-06-09T22:38:15.470-07:00,ns_1@172.23.96.197:ns_node_disco<0.598.0>:ns_node_disco:handle_info:163]Node 'ns_1@172.23.96.197' saw that node 'ns_1@172.23.136.103' came up. Tags: []
|
[chronicle:info,2024-06-09T22:38:15.470-07:00,ns_1@172.23.96.197:chronicle_proposer<0.10208.162>:chronicle_proposer:handle_nodeup:1093]Peer 'ns_1@172.23.136.103' came up
|
[ns_server:info,2024-06-09T22:38:15.471-07:00,ns_1@172.23.96.197:ns_node_disco_events<0.596.0>:ns_node_disco_log:handle_event:40]ns_node_disco_log: nodes changed: ['ns_1@172.23.104.235',
|
'ns_1@172.23.104.250',
|
'ns_1@172.23.136.103',
|
'ns_1@172.23.136.104','ns_1@172.23.96.197']
|
[ns_server:info,2024-06-09T22:38:15.471-07:00,ns_1@172.23.96.197:ns_config_rep<0.615.0>:ns_config_rep:handle_info:258]Replicating config to/from:
|
['ns_1@172.23.136.103']
|
but eventually it auto fails over with .241 itself
[user:info,2024-06-09T22:39:57.450-07:00,ns_1@172.23.96.197:<0.25807.310>:failover:orchestrate:184]Failed over ['ns_1@172.23.136.103']: ok
|
[ns_server:info,2024-06-09T22:39:57.451-07:00,ns_1@172.23.96.197:leader_quorum_nodes_manager<0.1464.0>:leader_quorum_nodes_manager:handle_set_quorum_nodes:121]Updating quorum nodes.
|
Old quorum nodes: ['ns_1@172.23.104.250','ns_1@172.23.136.103',
|
'ns_1@172.23.136.104','ns_1@172.23.104.235',
|
'ns_1@172.23.96.197']
|
New quorum nodes: ['ns_1@172.23.104.250','ns_1@172.23.136.104',
|
'ns_1@172.23.104.235','ns_1@172.23.96.197']
|
[user:info,2024-06-09T22:39:57.463-07:00,ns_1@172.23.96.197:<0.25807.310>:failover:deactivate_nodes:241]Deactivating failed over nodes ['ns_1@172.23.136.103']
|
[user:info,2024-06-09T22:39:57.601-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:log_rebalance_completion:1661]Failover completed successfully.
|
Rebalance Operation Id = e4e1005a5888374306dd19fa147f8051
|
seeing similar behaviour when subsequent node is of different service