Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49356

[System Test] Multiple index replicas incorrectly placed on same node (fka Autofailover deemed unsafe)

    XMLWordPrintable

Details

    Description

      Build : 7.1.0-1623
      Test : -test tests/2i/neo/test_neo_idx_clusterops_recovery.yml -scope tests/2i/neo/scope_neo_plasma_idx_dgm.yml
      Scale : 2
      Iteration : 1st

      The N1QL/GSI component system test now has a step to trigger auto failover for an indexer node.

      Auto-failover is configured to be detected in 30s. Couchbase service was stopped on 172.23.97.217 at 2021-11-03T01:56:05. The orchestrator node is 172.23.97.215. In the debug logs of 172.23.97.215, the following can be seen -

      =========================NOTICE REPORT=========================
      {net_kernel,{net_kernel,1157,nodedown,'ns_1@172.23.97.217'}}
      [ns_server:debug,2021-11-03T01:56:40.657-07:00,ns_1@172.23.97.215:<0.8743.0>:auto_failover_logic:log_master_activity:145]Incremented down state:
      {node_state,{'ns_1@172.23.97.217',<<"7d16aca9f67533ebdb0ec4c65e5d0b08">>},
                  1,nearly_down,false}
      ->{node_state,{'ns_1@172.23.97.217',<<"7d16aca9f67533ebdb0ec4c65e5d0b08">>},
                    1,failover,false}
      [ns_server:debug,2021-11-03T01:56:40.657-07:00,ns_1@172.23.97.215:<0.8743.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{failover,
                                      [{'ns_1@172.23.97.217',
                                        <<"7d16aca9f67533ebdb0ec4c65e5d0b08">>}]}]
      [user:info,2021-11-03T01:56:40.673-07:00,ns_1@172.23.97.215:<0.8743.0>:auto_failover:log_unsafe_node:633]Could not automatically fail over node ('ns_1@172.23.97.217') due to operation being unsafe for service index. Failing over nodes 172.23.97.217:9102(7d16aca9f67533ebdb0ec4c65e5d0b08) would lose the following indexes/partitions: bucket1.scope_3.coll_11.idx6_i15Ef 5
      [error_logger:info,2021-11-03T01:56:40.724-07:00,ns_1@172.23.97.215:net_kernel<0.8371.0>:ale_error_logger_handler:do_log:101]
      

      On 172.23.107.3 (indexer node), the following can be seen around the same time :

      2021-11-03T01:56:39.444-07:00 [Info] AutofailoverServiceManager::HealthCheck: Called
      2021-11-03T01:56:39.444-07:00 [Info] AutofailoverServiceManager::HealthCheck: Returning healthInfo: {DiskFailures:0}
      2021-11-03T01:56:40.658-07:00 [Info] AutofailoverServiceManager::IsSafe: Called with nodeUUIDs [7d16aca9f67533ebdb0ec4c65e5d0b08]
      2021-11-03T01:56:40.660-07:00 [Info] requestHandlerContext::getCachedIndexTopology: Returning 735 IndexStatuses
      2021-11-03T01:56:40.672-07:00 [Info] AutofailoverServiceManager::IsSafe: Returning user message: Failing over nodes 172.23.97.217:9102(7d16aca9f67533ebdb0ec4c65e5d0b08) would lose the following indexes/partitions: bucket1.scope_3.coll_11.idx6_i15Ef 5
      

      The index here in question - bucket1.scope_3.coll_11.idx6_i15Ef - has a replica. So the decision taken by AutofailoverServiceManager doesn't seem to be right.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              kevin.cherkauer Kevin Cherkauer (Inactive)
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty