Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50655

[BP MB-49356 to 7.0.x][System Test] Multiple index replicas incorrectly placed on same node (fka Autofailover deemed unsafe)

    XMLWordPrintable

Details

    Description

      Build : 7.1.0-1623
      Test : -test tests/2i/neo/test_neo_idx_clusterops_recovery.yml -scope tests/2i/neo/scope_neo_plasma_idx_dgm.yml
      Scale : 2
      Iteration : 1st

      The N1QL/GSI component system test now has a step to trigger auto failover for an indexer node.

      Auto-failover is configured to be detected in 30s. Couchbase service was stopped on 172.23.97.217 at 2021-11-03T01:56:05. The orchestrator node is 172.23.97.215. In the debug logs of 172.23.97.215, the following can be seen -

      =========================NOTICE REPORT=========================
      {net_kernel,{net_kernel,1157,nodedown,'ns_1@172.23.97.217'}}
      [ns_server:debug,2021-11-03T01:56:40.657-07:00,ns_1@172.23.97.215:<0.8743.0>:auto_failover_logic:log_master_activity:145]Incremented down state:
      {node_state,{'ns_1@172.23.97.217',<<"7d16aca9f67533ebdb0ec4c65e5d0b08">>},
                  1,nearly_down,false}
      ->{node_state,{'ns_1@172.23.97.217',<<"7d16aca9f67533ebdb0ec4c65e5d0b08">>},
                    1,failover,false}
      [ns_server:debug,2021-11-03T01:56:40.657-07:00,ns_1@172.23.97.215:<0.8743.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{failover,
                                      [{'ns_1@172.23.97.217',
                                        <<"7d16aca9f67533ebdb0ec4c65e5d0b08">>}]}]
      [user:info,2021-11-03T01:56:40.673-07:00,ns_1@172.23.97.215:<0.8743.0>:auto_failover:log_unsafe_node:633]Could not automatically fail over node ('ns_1@172.23.97.217') due to operation being unsafe for service index. Failing over nodes 172.23.97.217:9102(7d16aca9f67533ebdb0ec4c65e5d0b08) would lose the following indexes/partitions: bucket1.scope_3.coll_11.idx6_i15Ef 5
      [error_logger:info,2021-11-03T01:56:40.724-07:00,ns_1@172.23.97.215:net_kernel<0.8371.0>:ale_error_logger_handler:do_log:101]
      

      On 172.23.107.3 (indexer node), the following can be seen around the same time :

      2021-11-03T01:56:39.444-07:00 [Info] AutofailoverServiceManager::HealthCheck: Called
      2021-11-03T01:56:39.444-07:00 [Info] AutofailoverServiceManager::HealthCheck: Returning healthInfo: {DiskFailures:0}
      2021-11-03T01:56:40.658-07:00 [Info] AutofailoverServiceManager::IsSafe: Called with nodeUUIDs [7d16aca9f67533ebdb0ec4c65e5d0b08]
      2021-11-03T01:56:40.660-07:00 [Info] requestHandlerContext::getCachedIndexTopology: Returning 735 IndexStatuses
      2021-11-03T01:56:40.672-07:00 [Info] AutofailoverServiceManager::IsSafe: Returning user message: Failing over nodes 172.23.97.217:9102(7d16aca9f67533ebdb0ec4c65e5d0b08) would lose the following indexes/partitions: bucket1.scope_3.coll_11.idx6_i15Ef 5
      

      The index here in question - bucket1.scope_3.coll_11.idx6_i15Ef - has a replica. So the decision taken by AutofailoverServiceManager doesn't seem to be right.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            MB-49356 Part 6 https://review.couchbase.org/c/indexing/+/169637 needs to be backported to 7.0.x, otherwise Planner can violate HA constraints and put multiple replicas onto the same Index node. The problem was introduced in 7.0.0 by MB-42220 and partially fixed by MB-44311, but that fix missed a couple cases that are to be first fixed in 7.1.0 by Part 6 of MB-49356.

             

            FYI Jeelan Poola Deepkaran Salooja 

             

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - - edited MB-49356 Part 6 https://review.couchbase.org/c/indexing/+/169637 needs to be backported to 7.0.x, otherwise Planner can violate HA constraints and put multiple replicas onto the same Index node. The problem was introduced in 7.0.0 by MB-42220 and partially fixed by MB-44311 , but that fix missed a couple cases that are to be first fixed in 7.1.0 by Part 6 of MB-49356 .   FYI Jeelan Poola Deepkaran Salooja    

            merged code to cheshire-cat branch, next build should have the fix.

            yogendra.acharya Yogendra Acharya (Inactive) added a comment - merged code to cheshire-cat branch, next build should have the fix.

            Build couchbase-server-7.0.4-7241 contains indexing commit 5db98b0 with commit message:
            MB-50655: Multiple index replicas incorrectly placed on same node (fka Autofailover deemed unsafe)

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7241 contains indexing commit 5db98b0 with commit message: MB-50655 : Multiple index replicas incorrectly placed on same node (fka Autofailover deemed unsafe)

            Amit Kulkarni Yogendra Acharya Kevin Cherkauer how should this be verified in 7.0.4 in the absence of Autofailover ? Any pointers ?

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Amit Kulkarni Yogendra Acharya Kevin Cherkauer how should this be verified in 7.0.4 in the absence of Autofailover ? Any pointers ?

            Mihir Kamdar The test for which the original version of this bug, MB-49356, was opened used to reproduce it quite often, but the detection of the problem is in Autofailover code, so on pre-Autofailover versions of the code it won't be detected. Thus I think it needs to be verified by code inspection that the backport was accurate. It is a very trivial fix in the code.

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - Mihir Kamdar The test for which the original version of this bug, MB-49356 , was opened used to reproduce it quite often, but the detection of the problem is in Autofailover code, so on pre-Autofailover versions of the code it won't be detected. Thus I think it needs to be verified by code inspection that the backport was accurate. It is a very trivial fix in the code.

            Per Mihir Kamdar 's request for quick action, I have verified the backport is accurate using Gerrit diffs. I also verified there were no other uses of indexer.NodeUUID in genTransferToken() in the earlier codebase. Hence closing this issue.

            kevin.cherkauer Kevin Cherkauer (Inactive) added a comment - Per Mihir Kamdar 's request for quick action, I have verified the backport is accurate using Gerrit diffs. I also verified there were no other uses of indexer.NodeUUID in genTransferToken() in the earlier codebase. Hence closing this issue.

            Thanks Kevin. Appreciate the quick turnaround !

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Thanks Kevin. Appreciate the quick turnaround !

            People

              yogendra.acharya Yogendra Acharya (Inactive)
              kevin.cherkauer Kevin Cherkauer (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty