Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: 7.6.0
Affects Version/s: 7.6.0
Component/s: ns_server
Labels:
- fast_failover
Environment:
7.6.0-1507
Centos 7 64bit

Triage:
Untriaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.64.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.65.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.66.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.67.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.68.zip

Show
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.64.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.65.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.66.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.67.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.68.zip
Story Points:
0
Is this a Regression?:
No

Description

Steps:

5 node cluster with n2n encryption level=all
Set auto-failover timeout=1
Induce failure stop_memcached on node '172.23.110.65' (kill -STOP)
Wait for auto failover to happen

Observations:

As per the test, the stop_memcached was triggered at '21:40:22,424' and ns_server detected the node_down immediately,

[ns_server:debug,2023-09-14T21:40:22.944-07:00,ns_1@172.23.110.64:<0.15702.0>:auto_failover:log_down_nodes_reason:403]Node 'ns_1@172.23.110.65' is considered down. Reason:"The data service did not respond. Either none of the buckets have warmed up or there is an issue with the data service. "

But the actual failover was initiated only after ~5 seconds

[ns_server:debug,2023-09-14T21:40:27.946-07:00,ns_1@172.23.110.64:<0.15700.0>:failover:start:44]Starting failover with Nodes = ['ns_1@172.23.110.65'], Options = #{allow_unsafe => ...

TAF test:

failover.concurrent_failovers.ConcurrentFailoverTests:

    test_concurrent_failover,nodes_init=5,services_init=kv-kv-kv-kv-kv,replicas=3,maxCount=1,timeout=1,failover_order=kv,failover_method=stop_memcached,bucket_spec=single_bucket.default

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

test.log
1.14 MB
14/Sep/23 10:27 PM

Issue Links

duplicates

MB-58636 Janitor cannot be cancelled if memcached is unresponsive in query_vbuckets_loop

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Ben Huddleston

Reporter:: Ashwin Govindarajulu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Sep/23 10:00 PM

Updated:: 19/Sep/23 9:15 PM

Resolved:: 18/Sep/23 1:39 AM

Gerrit Reviews

There are no open Gerrit changes

Auto failover timeout not honoured for stop_memcached scenario with timeout=1

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty