Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 7.6.2
Component/s: ns_server
Labels:
- functional-test
Environment:
Enterprise Edition 7.6.2 build 3674

Triage:
Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.104.235.zip
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.104.250.zip
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.136.103.zip
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.136.104.zip
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.96.197.zip

Show
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.104.235.zip https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.104.250.zip https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.136.103.zip https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.136.104.zip https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-06-10T054059-ns_1%40172.23.96.197.zip
Story Points:
0
Is this a Regression?:
Unknown

Description

steps

1. create a 6 node cluster

172.23.104.235, 172.23.104.241, 172.23.104.250, 172.23.136.103, 172.23.136.104, 172.23.96.197

2. enable autofailover with following settings
timeout : 90 , max events: 5

3. enable some delay in the autofailover on .197

curl -k https://Administrator:password@localhost:18091/diag/eval -X POST -d 'testconditions:set(failover_end, {delay, 120000}).'

4. bring down data node to trigger autofailover for .241

ns_server:info,2024-06-09T22:39:53.095-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:handle_event:670]Skipping janitor in state rebalancing

[user:info,2024-06-09T22:39:56.544-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:log_rebalance_completion:1661]Failover completed successfully.

Rebalance Operation Id = 80a2b06c3374f8c4b7593f1629673562

[ns_server:info,2024-06-09T22:39:56.602-07:00,ns_1@172.23.96.197:leader_registry<0.887.0>:leader_registry:handle_down:286]Process <0.17680.310> registered as 'ns_rebalance_observer' terminated.

[user:info,2024-06-09T22:39:56.603-07:00,ns_1@172.23.96.197:<0.1538.0>:auto_failover:log_failover_success:662]Node ('ns_1@172.23.104.241') was automatically failed over. Reason: All monitors report node is unhealthy.

[user:info,2024-06-09T22:39:56.969-07:00,ns_1@172.23.96.197:<0.25807.310>:failover:orchestrate:172]Starting failing over ['ns_1@172.23.136.103']

5. bring down another node 172.23.136.103 while failover is getting triggered but quickly bring it up again before 90 seconds

[chronicle:info,2024-06-09T22:38:01.488-07:00,ns_1@172.23.96.197:chronicle_proposer<0.10208.162>:chronicle_proposer:handle_down:1142]Observed agent {chronicle_agent,'ns_1@172.23.136.103'} on peer 'ns_1@172.23.136.103' go down with reason noconnection

[user:warn,2024-06-09T22:38:01.488-07:00,ns_1@172.23.96.197:ns_node_disco<0.598.0>:ns_node_disco:handle_info:169]Node 'ns_1@172.23.96.197' saw that node 'ns_1@172.23.136.103' went down. Details: [{nodedown_reason,

                                                                                    connection_closed}]

[chronicle:info,2024-06-09T22:38:01.489-07:00,ns_1@172.23.96.197:chronicle_proposer<0.10208.162>:chronicle_proposer:handle_nodedown:1135]Peer 'ns_1@172.23.136.103' went down: [{nodedown_reason,connection_closed}]

[ns_server:info,2024-06-09T22:38:01.489-07:00,ns_1@172.23.96.197:ns_node_disco_events<0.596.0>:ns_node_disco_log:handle_event:40]ns_node_disco_log: nodes changed: ['ns_1@172.23.104.235',

                                   'ns_1@172.23.104.250',

                                   'ns_1@172.23.136.104','ns_1@172.23.96.197']

[ns_server:info,2024-06-09T22:38:03.073-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:handle_event:670]Skipping janitor in state rebalancing

came up logs

[user:info,2024-06-09T22:38:15.470-07:00,ns_1@172.23.96.197:ns_node_disco<0.598.0>:ns_node_disco:handle_info:163]Node 'ns_1@172.23.96.197' saw that node 'ns_1@172.23.136.103' came up. Tags: []

[chronicle:info,2024-06-09T22:38:15.470-07:00,ns_1@172.23.96.197:chronicle_proposer<0.10208.162>:chronicle_proposer:handle_nodeup:1093]Peer 'ns_1@172.23.136.103' came up

[ns_server:info,2024-06-09T22:38:15.471-07:00,ns_1@172.23.96.197:ns_node_disco_events<0.596.0>:ns_node_disco_log:handle_event:40]ns_node_disco_log: nodes changed: ['ns_1@172.23.104.235',

                                   'ns_1@172.23.104.250',

                                   'ns_1@172.23.136.103',

                                   'ns_1@172.23.136.104','ns_1@172.23.96.197']

[ns_server:info,2024-06-09T22:38:15.471-07:00,ns_1@172.23.96.197:ns_config_rep<0.615.0>:ns_config_rep:handle_info:258]Replicating config to/from:

['ns_1@172.23.136.103']

but eventually it auto fails over with .241 itself

[user:info,2024-06-09T22:39:57.450-07:00,ns_1@172.23.96.197:<0.25807.310>:failover:orchestrate:184]Failed over ['ns_1@172.23.136.103']: ok

[ns_server:info,2024-06-09T22:39:57.451-07:00,ns_1@172.23.96.197:leader_quorum_nodes_manager<0.1464.0>:leader_quorum_nodes_manager:handle_set_quorum_nodes:121]Updating quorum nodes.

Old quorum nodes: ['ns_1@172.23.104.250','ns_1@172.23.136.103',

                   'ns_1@172.23.136.104','ns_1@172.23.104.235',

                   'ns_1@172.23.96.197']

New quorum nodes: ['ns_1@172.23.104.250','ns_1@172.23.136.104',

                   'ns_1@172.23.104.235','ns_1@172.23.96.197']

[user:info,2024-06-09T22:39:57.463-07:00,ns_1@172.23.96.197:<0.25807.310>:failover:deactivate_nodes:241]Deactivating failed over nodes ['ns_1@172.23.136.103']

[user:info,2024-06-09T22:39:57.601-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:log_rebalance_completion:1661]Failover completed successfully.

Rebalance Operation Id = e4e1005a5888374306dd19fa147f8051

seeing similar behaviour when subsequent node is of different service

Attachments

Issue Links

duplicates

MB-61507 Disconnected DCP connection not brought up until rebalance completes.

Open

MB-30782 Janitor should be run when service rebalance is in progress

Reopened

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Subsequent node fails over even if it comes back up before timeout

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty