Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: 7.6.0
Affects Version/s: 7.6.0
Component/s: ns_server
Labels:
- fast_failover
Environment:
7.6.0-1507
Centos 7 64bit

Triage:
Untriaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.64.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.65.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.66.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.67.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.68.zip

Show
https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.64.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.65.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.66.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.67.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/fast_fo_time_issue/collectinfo-2023-09-15T044446-ns_1%40172.23.110.68.zip
Story Points:
0
Is this a Regression?:
No

Description

Steps:

5 node KV cluster with n2n encryption level=all
1 magma bucket with replica=3
Set auto-failover timeout=1
Induce failure (stop_memcached using SIGSTOP) on '172.23.110.65'
Wait for auto-failover to happen

Observations:

From test POV, we are inducing the failures at "21:40:22.944" and the ns_server detects the node is down immediately,

[ns_server:debug,2023-09-14T21:40:22.944-07:00,ns_1@172.23.110.64:<0.15702.0>:auto_failover:log_down_nodes_reason:403]Node 'ns_1@172.23.110.65' is considered down. Reason:"The data service did not respond. Either none of the buckets have warmed up or there is an issue with the data service. "

But, the nserver is taking 5 seconds to trigger the auto-failover here,

[ns_server:debug,2023-09-14T21:40:27.946-07:00,ns_1@172.23.110.64:<0.15700.0>:failover:start:44]Starting failover with Nodes = ['ns_1@172.23.110.65'], Options = #{allow_unsafe =>...

TAF test:

failover.concurrent_failovers.ConcurrentFailoverTests:

    test_concurrent_failover,nodes_init=5,services_init=kv-kv-kv-kv-kv,replicas=3,maxCount=1,timeout=1,failover_order=kv,failover_method=stop_memcached,bucket_spec=single_bucket.default

Attachments

Issue Links

duplicates

MB-58636 Janitor cannot be cancelled if memcached is unresponsive in query_vbuckets_loop

Closed

Activity

People

Assignee:: Ashwin Govindarajulu

Reporter:: Ashwin Govindarajulu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 14/Sep/23 10:00 PM

Updated:: 19/Sep/23 9:15 PM

Resolved:: 15/Sep/23 9:22 AM

Auto failover timeout not honoured for stop_memcached scenario