Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 7.6.2
Affects Version/s: 7.6.2
Component/s: ns_server
Labels:
- approved-for-7.6.2
- manual_testing
Environment:
Enterprise Edition 7.6.2 build 3619

Triage:
Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T061822-ns_1%40172.23.136.106.zip
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T061822-ns_1%40172.23.136.110.zip
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T061822-ns_1%40172.23.136.114.zip
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T061822-ns_1%40172.23.136.115.zip
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T062234-ns_1%40172.23.136.104.zip
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T062234-ns_1%40172.23.136.109.zip

Show
https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T061822-ns_1%40172.23.136.106.zip https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T061822-ns_1%40172.23.136.110.zip https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T061822-ns_1%40172.23.136.114.zip https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T061822-ns_1%40172.23.136.115.zip https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T062234-ns_1%40172.23.136.104.zip https://cb-engineering.s3.amazonaws.com/test/collectinfo-2024-05-14T062234-ns_1%40172.23.136.109.zip
Story Points:
0
Is this a Regression?:
Unknown

Description

Steps
1 . create a 6 node cluster
172.23.136.104 - data
172.23.136.106 - data
172.23.136.109 - data
172.23.136.110 - query
172.23.136.114 - index
172.23.136.115 - data

2. hit the api that delays autofailover by 1 min

 curl -k https://Administrator:password@localhost:18091/diag/eval -X POST -d 'testconditions:set(failover_start, {delay,60000 })'

3. set Auto-failover timeout - 60 and max nodes - 2

4. bring down .104 , autofailover starts for .104

user:info,2024-05-13T22:48:31.455-07:00,ns_1@172.23.136.110:<0.17382.6>:failover:orchestrate:172]Starting failing over ['ns_1@172.23.136.104']

5. as autofailover is delayed , bring down second node .109 in around middle of ongoing
failover after around ~30 seconds passed

[user:warn,2024-05-13T22:49:02.053-07:00,ns_1@172.23.136.106:ns_node_disco<0.7214.0>:ns_node_disco:handle_info:169]Node 'ns_1@172.23.136.106' saw that node 'ns_1@172.23.136.109' went down. Details: [{nodedown_reason,

                                                                                     shutdown}]

6. autofailover for .104 fails as it might not be able to active replicas on .109

user:error,2024-05-13T22:49:31.524-07:00,ns_1@172.23.136.110:<0.8883.0>:ns_orchestrator:log_rebalance_completion:1661]Failover exited with reason {failover_failed,"gamesim-sample",

                                "Failed to get failover info for bucket \"gamesim-sample\": ['ns_1@172.23.136.109']"}.

Rebalance Operation Id = 14aa0cd61ddc898532fcb445e44e14fc

now expected next failover of .104 and .109 in around 30 seconds as timeout is set for 60 seconds and ticks for .109 must have been going on while AFO is delayed but next failover took around 60 more seconds.

[user:info,2024-05-13T22:50:33.094-07:00,ns_1@172.23.136.110:<0.25124.6>:failover:orchestrate:184]Failed over ['ns_1@172.23.136.104','ns_1@172.23.136.109']: ok

[ns_server:info,2024-05-13T22:50:33.095-07:00,ns_1@172.23.136.110:leader_quorum_nodes_manager<0.8852.0>:leader_quorum_nodes_manager:handle_set_quorum_nodes:121]Updating quorum nodes.

Old quorum nodes: ['ns_1@172.23.136.110','ns_1@172.23.136.104',

                   'ns_1@172.23.136.114','ns_1@172.23.136.115',

                   'ns_1@172.23.136.106','ns_1@172.23.136.109']

New quorum nodes: ['ns_1@172.23.136.110','ns_1@172.23.136.114',

                   'ns_1@172.23.136.115','ns_1@172.23.136.106']

[ns_server:error,2024-05-13T22:50:33.105-07:00,ns_1@172.23.136.110:leader_quorum_nodes_manager<0.8852.0>:ns_config_rep:synchronize_remote:356]Failed to synchronize config to some nodes:

[{'ns_1@172.23.136.109',

     {exit,

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Pulkit Matta

Reporter:: Pulkit Matta

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/May/24 8:55 AM

Updated:: 04/Jun/24 11:19 PM

Resolved:: 04/Jun/24 4:17 PM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-61881: Add failover_end testcondition: Gerrit Review:

Merge branch 'trinity': Gerrit Review:

Ticks for subsequent node down does not seem to happen during ongoing failover

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty