Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: 7.6.0
Affects Version/s: 7.6.0
Component/s: ns_server
Labels:
- fast_failover
- system-test
Environment:

Enterprise Edition 7.6.0 build 1728

Triage:
Untriaged
Operating System:
Linux x86_64
Story Points:
0
Is this a Regression?:
No

Description

Script to Repro

./sequoia -client 172.23.104.27:2375 -provider file:debian_pine.yml -test tests/integration/7.6/test_7.6.yml -scope tests/integration/7.6/scope_7.6_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.6.0-1728 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=1209600 -show_topology=true

This is one of our early runs with fast failover enalbed(1s AF timeout) as requested by Abhijeeth Nuthan on our longevity test.

[2023-10-31T07:19:58-07:00, sequoiatools/couchbase-cli:7.6:d60f0f] setting-autofailover -c 172.23.106.108:8091 -u Administrator -p password --enable-auto-failover=1 --auto-failover-timeout=1 --max-failovers=10

During the following step of the longevity run where we hard failover a node(172.23.106.109) and rebalance it out.

[2023-10-31T18:28:58-07:00, sequoiatools/couchbase-cli:7.6:2c7c64] failover -c 172.23.106.108:8091 --server-failover 172.23.106.109:8091 -u Administrator -p password --hard

[2023-10-31T18:29:08-07:00, sequoiatools/couchbase-cli:7.6:0aebf7] rebalance -c 172.23.106.108:8091 -u Administrator -p password

ns_1@172.23.104.2156:29:01 PM 31 Oct, 2023

Failed over ['ns_1@172.23.106.109']: ok

ns_1@172.23.104.2156:29:02 PM 31 Oct, 2023

Failover completed successfully.

Rebalance Operation Id = d77efaca00c7b3dbc719aa54792ab014

ns_1@172.23.104.2156:29:11 PM 31 Oct, 2023

Starting rebalance, KeepNodes = ['ns_1@172.23.104.213','ns_1@172.23.104.215',

'ns_1@172.23.104.227','ns_1@172.23.105.237',

'ns_1@172.23.105.238','ns_1@172.23.105.63',

'ns_1@172.23.106.108','ns_1@172.23.106.110',

'ns_1@172.23.106.121','ns_1@172.23.106.124',

'ns_1@172.23.106.164','ns_1@172.23.120.59',

'ns_1@172.23.121.72','ns_1@172.23.121.87',

'ns_1@172.23.121.94','ns_1@172.23.124.27',

'ns_1@172.23.96.170','ns_1@172.23.96.186',

'ns_1@172.23.96.203','ns_1@172.23.96.251',

'ns_1@172.23.96.252','ns_1@172.23.96.253',

'ns_1@172.23.97.189','ns_1@172.23.97.229',

'ns_1@172.23.97.242','ns_1@172.23.97.243',

'ns_1@172.23.97.244','ns_1@172.23.97.245'], EjectNodes = [], Failed over and being ejected nodes = ['ns_1@172.23.106.109']; no delta recovery nodes; Operation Id = caf337cdcce2c0887390af6a11e6356a

While the above rebalance was running we saw the node 172.23.106.108 was trying to get auto failedover.

ns_1@172.23.104.215 8:18:11 PM 31 Oct, 2023

Could not automatically fail over nodes (['ns_1@172.23.106.108']). Rebalance is running. (repeated 1 times, last seen 45.471187 secs ago)

This is interesting cause we removed this feature that prevented AF from not aborting rebalance through ~~MB-49059~~. We saw the above message possibly 100's of times but AF was not getting triggered because of an ongoing rebalance.

Interestingly few hours later orchestrator was trying to failover another 7 nodes which again failed cause of ongoing rebalance.
ns_1@172.23.104.215 10:35:20 PM 31 Oct, 2023

Could not automatically fail over nodes (['ns_1@172.23.106.108',

'ns_1@172.23.96.170',

'ns_1@172.23.96.186',

'ns_1@172.23.96.251',

'ns_1@172.23.97.189',

'ns_1@172.23.97.229',

'ns_1@172.23.97.244']). Rebalance is running.

I believe there are 2 important questions to be answered here.

Why are multiple nodes trying to get auto failed over ? ( Possibly because we reduced AF timeout to 1s and now we are more prone to AF's because of the flakiness in the system suggesting that a setup like ours might not be good use case for fast failover ? )
Why is the AF being prevented from ongoing rebalance which was removed in 7.5 through ~~MB-49059~~?

cbcollect_info attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

consoleText_MB-59397.txt
3.79 MB
01/Nov/23 12:10 AM
Screenshot 2023-11-01 at 12.22.49 PM.png
53 kB
31/Oct/23 11:53 PM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Balakumaran Gopal

Reporter:: Balakumaran Gopal

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 31/Oct/23 11:59 PM

Updated:: 06/Nov/23 11:18 PM

Resolved:: 02/Nov/23 1:27 AM

Gerrit Reviews

There are no open Gerrit changes

[System Test][Fast failover] :- Nodes trying to get auto failed over are prevented from a running rebalance.

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty