Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.0.0
-
Untriaged
-
-
1
-
No
Description
Build : 7.1.0-1623
Test : -test tests/2i/neo/test_neo_idx_clusterops_recovery.yml -scope tests/2i/neo/scope_neo_plasma_idx_dgm.yml
Scale : 2
Iteration : 1st
The N1QL/GSI component system test now has a step to trigger auto failover for an indexer node.
Auto-failover is configured to be detected in 30s. Couchbase service was stopped on 172.23.97.217 at 2021-11-03T01:56:05. The orchestrator node is 172.23.97.215. In the debug logs of 172.23.97.215, the following can be seen -
=========================NOTICE REPORT=========================
|
{net_kernel,{net_kernel,1157,nodedown,'ns_1@172.23.97.217'}}
|
[ns_server:debug,2021-11-03T01:56:40.657-07:00,ns_1@172.23.97.215:<0.8743.0>:auto_failover_logic:log_master_activity:145]Incremented down state:
|
{node_state,{'ns_1@172.23.97.217',<<"7d16aca9f67533ebdb0ec4c65e5d0b08">>},
|
1,nearly_down,false}
|
->{node_state,{'ns_1@172.23.97.217',<<"7d16aca9f67533ebdb0ec4c65e5d0b08">>},
|
1,failover,false}
|
[ns_server:debug,2021-11-03T01:56:40.657-07:00,ns_1@172.23.97.215:<0.8743.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{failover,
|
[{'ns_1@172.23.97.217',
|
<<"7d16aca9f67533ebdb0ec4c65e5d0b08">>}]}]
|
[user:info,2021-11-03T01:56:40.673-07:00,ns_1@172.23.97.215:<0.8743.0>:auto_failover:log_unsafe_node:633]Could not automatically fail over node ('ns_1@172.23.97.217') due to operation being unsafe for service index. Failing over nodes 172.23.97.217:9102(7d16aca9f67533ebdb0ec4c65e5d0b08) would lose the following indexes/partitions: bucket1.scope_3.coll_11.idx6_i15Ef 5
|
[error_logger:info,2021-11-03T01:56:40.724-07:00,ns_1@172.23.97.215:net_kernel<0.8371.0>:ale_error_logger_handler:do_log:101]
|
On 172.23.107.3 (indexer node), the following can be seen around the same time :
2021-11-03T01:56:39.444-07:00 [Info] AutofailoverServiceManager::HealthCheck: Called
|
2021-11-03T01:56:39.444-07:00 [Info] AutofailoverServiceManager::HealthCheck: Returning healthInfo: {DiskFailures:0}
|
2021-11-03T01:56:40.658-07:00 [Info] AutofailoverServiceManager::IsSafe: Called with nodeUUIDs [7d16aca9f67533ebdb0ec4c65e5d0b08]
|
2021-11-03T01:56:40.660-07:00 [Info] requestHandlerContext::getCachedIndexTopology: Returning 735 IndexStatuses
|
2021-11-03T01:56:40.672-07:00 [Info] AutofailoverServiceManager::IsSafe: Returning user message: Failing over nodes 172.23.97.217:9102(7d16aca9f67533ebdb0ec4c65e5d0b08) would lose the following indexes/partitions: bucket1.scope_3.coll_11.idx6_i15Ef 5
|
The index here in question - bucket1.scope_3.coll_11.idx6_i15Ef - has a replica. So the decision taken by AutofailoverServiceManager doesn't seem to be right.
Attachments
Issue Links
- is a backport of
-
MB-49356 [System Test] Multiple index replicas incorrectly placed on same node (fka Autofailover deemed unsafe)
- Closed
For Gerrit Dashboard: MB-50655 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
173330,5 | MB-50655: Multiple index replicas incorrectly placed on same node (fka Autofailover deemed unsafe) | cheshire-cat | indexing | Status: MERGED | +2 | +1 |