Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: 7.6.0
Affects Version/s: 7.6.0
Component/s: ns_server
Labels:
- functional-test
- orchestrator_selection
Environment:
enterprise :: 7.6.0-1525 linux_amd64

Triage:
Untriaged
Operating System:
Linux x86_64
Story Points:
0
Is this a Regression?:
Yes

Description

Steps:

3 node KV cluster

+----------------+---------+--------+-----------+----------+------------------------+

| Nodes          | Zone    | CPU    | Mem_total | Mem_free | Swap_mem_used          |

+----------------+---------+--------+-----------+----------+------------------------+

| 172.23.97.199  | Group 1 | 0.4483 | 3.81 GiB  | 3.01 GiB | 13.25 MiB / 976.00 MiB |

| 172.23.97.200  | Group 1 | 3.6319 | 3.81 GiB  | 2.96 GiB | 21.13 MiB / 976.00 MiB |

| 172.23.121.117 | Group 1 | 0.4005 | 3.81 GiB  | 3.03 GiB | 13.50 MiB / 976.00 MiB |

+----------------+---------+--------+-----------+----------+------------------------+

magma bucket with replicas=2

+---------+-----------+---------+----------+------------+-------+----------+-----------+------------+

| Bucket  | Type      | Storage | Replicas | Durability | Items | Vbuckets | RAM Quota | RAM Used   |

+---------+-----------+---------+----------+------------+-------+----------+-----------+------------+

| default | couchbase | magma   | 2        | none       | 37366 | 1024     | 8.84 GiB  | 449.49 MiB |

+---------+-----------+---------+----------+------------+-------+----------+-----------+------------+

Set auto-failover timeout=5
Induce failure by restarting the error 'restart_network' on '172.23.97.199'
Command used:
service network stop && sleep {} && service network start

Observation:

Failover of node '172.23.97.199' was initiated by the current master node (172.23.97.200)

Post failover, the orchestrator is reported as ".199" (the one got failed over itself)

From .200 node's logs:

[ns_server:debug,2023-09-20T01:54:00.662-07:00,ns_1@172.23.97.200:<0.853.0>:auto_failover:log_down_nodes_reason:403]Node 'ns_1@172.23.97.199' is considered down. Reason:"The cluster manager did not respond. "

...

[ns_server:debug,2023-09-20T01:54:01.664-07:00,ns_1@172.23.97.200:<0.853.0>:auto_failover:log_down_nodes_reason:403]Node 'ns_1@172.23.97.199' is considered down. Reason:"All monitors report node is unhealthy."

...

[ns_server:debug,2023-09-20T01:54:05.670-07:00,ns_1@172.23.97.200:<0.851.0>:failover:start:44]Starting failover with Nodes = ['ns_1@172.23.97.199'],

    Options = #{allow_unsafe => false,

                auto => true,

                down_nodes => ['ns_1@172.23.97.199'],

                failover_reasons =>

                [{'ns_1@172.23.97.199', "All monitors report node is unhealthy."}]}

...

[user:info,2023-09-20T01:54:08.232-07:00,ns_1@172.23.97.200:<0.22237.0>:failover:deactivate_nodes:225]Deactivating failed over nodes ['ns_1@172.23.97.199']

[ns_server:debug,2023-09-20T01:54:08.233-07:00,ns_1@172.23.97.200:<0.22237.0>:chronicle_master:call:71]Calling chronicle_master with {deactivate_nodes,['ns_1@172.23.97.199']}

[ns_server:debug,2023-09-20T01:54:08.259-07:00,ns_1@172.23.97.200:<0.770.0>:chronicle_master:handle_oper:342]Starting kv operation {deactivate_nodes,['ns_1@172.23.97.199']} with lock <<"210e2cf5f3a001080e5904f6ed8b9eb0">>

TAF test:

failover.AutoFailoverTests.AutoFailoverTests:

     test_autofailover,timeout=5,num_node_failures=1,nodes_init=4,failover_action=restart_network,nodes_init=3,replicas=2,durability=MAJORITY_AND_PERSIST_TO_ACTIVE,num_items=50000

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

collectinfo-2023-09-20T090018-ns_1@172.23.121.117.zip
13.61 MB
20/Sep/23 4:26 AM
collectinfo-2023-09-20T090018-ns_1@172.23.97.199.zip
10.88 MB
20/Sep/23 4:26 AM
collectinfo-2023-09-20T090018-ns_1@172.23.97.200.zip
16.87 MB
20/Sep/23 4:26 AM
test_run.log
25 kB
20/Sep/23 4:36 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Ashwin Govindarajulu

Reporter:: Ashwin Govindarajulu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Sep/23 4:31 AM

Updated:: 20/Sep/23 5:16 AM

Resolved:: 20/Sep/23 5:14 AM

Gerrit Reviews

There are no open Gerrit changes

Orchestrator selection gone wrong during failover with network restart scenario

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty