Details
-
Bug
-
Resolution: Not a Bug
-
Critical
-
7.6.0
-
Enterprise Edition 7.6.0 build 1728
-
Untriaged
-
Linux x86_64
-
0
-
No
Description
Script to Repro
./sequoia -client 172.23.104.27:2375 -provider file:debian_pine.yml -test tests/integration/7.6/test_7.6.yml -scope tests/integration/7.6/scope_7.6_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.6.0-1728 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=1209600 -show_topology=true
|
This is one of our early runs with fast failover enalbed(1s AF timeout) as requested by Abhijeeth Nuthan on our longevity test.
[2023-10-31T07:19:58-07:00, sequoiatools/couchbase-cli:7.6:d60f0f] setting-autofailover -c 172.23.106.108:8091 -u Administrator -p password --enable-auto-failover=1 --auto-failover-timeout=1 --max-failovers=10
|
During the following step of the longevity run where we hard failover a node(172.23.106.109) and rebalance it out.
[2023-10-31T18:28:58-07:00, sequoiatools/couchbase-cli:7.6:2c7c64] failover -c 172.23.106.108:8091 --server-failover 172.23.106.109:8091 -u Administrator -p password --hard
|
[2023-10-31T18:29:08-07:00, sequoiatools/couchbase-cli:7.6:0aebf7] rebalance -c 172.23.106.108:8091 -u Administrator -p password
|
ns_1@172.23.104.2156:29:01 PM 31 Oct, 2023
Failed over ['ns_1@172.23.106.109']: ok
|
ns_1@172.23.104.2156:29:02 PM 31 Oct, 2023
Failover completed successfully.
|
Rebalance Operation Id = d77efaca00c7b3dbc719aa54792ab014
|
ns_1@172.23.104.2156:29:11 PM 31 Oct, 2023
Starting rebalance, KeepNodes = ['ns_1@172.23.104.213','ns_1@172.23.104.215',
|
'ns_1@172.23.104.227','ns_1@172.23.105.237',
|
'ns_1@172.23.105.238','ns_1@172.23.105.63',
|
'ns_1@172.23.106.108','ns_1@172.23.106.110',
|
'ns_1@172.23.106.121','ns_1@172.23.106.124',
|
'ns_1@172.23.106.164','ns_1@172.23.120.59',
|
'ns_1@172.23.121.72','ns_1@172.23.121.87',
|
'ns_1@172.23.121.94','ns_1@172.23.124.27',
|
'ns_1@172.23.96.170','ns_1@172.23.96.186',
|
'ns_1@172.23.96.203','ns_1@172.23.96.251',
|
'ns_1@172.23.96.252','ns_1@172.23.96.253',
|
'ns_1@172.23.97.189','ns_1@172.23.97.229',
|
'ns_1@172.23.97.242','ns_1@172.23.97.243',
|
'ns_1@172.23.97.244','ns_1@172.23.97.245'], EjectNodes = [], Failed over and being ejected nodes = ['ns_1@172.23.106.109']; no delta recovery nodes; Operation Id = caf337cdcce2c0887390af6a11e6356a
|
While the above rebalance was running we saw the node 172.23.106.108 was trying to get auto failedover.
ns_1@172.23.104.215 8:18:11 PM 31 Oct, 2023
Could not automatically fail over nodes (['ns_1@172.23.106.108']). Rebalance is running. (repeated 1 times, last seen 45.471187 secs ago)
|
This is interesting cause we removed this feature that prevented AF from not aborting rebalance through MB-49059. We saw the above message possibly 100's of times but AF was not getting triggered because of an ongoing rebalance.
Interestingly few hours later orchestrator was trying to failover another 7 nodes which again failed cause of ongoing rebalance.
ns_1@172.23.104.215 10:35:20 PM 31 Oct, 2023
Could not automatically fail over nodes (['ns_1@172.23.106.108',
|
'ns_1@172.23.96.170',
|
'ns_1@172.23.96.186',
|
'ns_1@172.23.96.251',
|
'ns_1@172.23.97.189',
|
'ns_1@172.23.97.229',
|
'ns_1@172.23.97.244']). Rebalance is running.
|
I believe there are 2 important questions to be answered here.
- Why are multiple nodes trying to get auto failed over ? ( Possibly because we reduced AF timeout to 1s and now we are more prone to AF's because of the flakiness in the system suggesting that a setup like ours might not be good use case for fast failover ? )
- Why is the AF being prevented from ongoing rebalance which was removed in 7.5 through
MB-49059?
cbcollect_info attached.