Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49079

CBSE: Do not fail Rebalance for merge replica counter failure

    XMLWordPrintable

Details

    • 1

    Description

      [Split out of MB-47873 per 2021-10-20 GSI scrum. Originally opened by Varun Velamuri from CBSE-10499.
      MB-47874 was also created for the same problem from same CBSE, aiming to find and fix the root cause, while current MB is aimed at making Rebalance not fail even if this bug gets hit.]

      If merging replica counter fails, indexer can avoid replica repair for the definition on which merge failed, rather than failing rebalance.

      From CBSE-10499 description, which is the source of the original error:

      Background and Analysis:

       Analytics Nodes were added on `2021-07-27`, Rebalance is failing since then:

      2021-07-27T10:08:18.443000+0100 10.114.141.4 added node 10.114.141.16
       
      2021-07-27T10:14:59.570000+0100 10.114.141.4 added node 10.114.141.17
       

       Rebalance is failing with below error:

      [ns_server:error,2021-07-27T13:19:53.907+01:00,ns_1@10.114.141.10:service_status_keeper-index<0.725.0>:service_status_keeper:handle_cast:119]Service service_index returned incorrect status [ns_server:error,2021-07-27T13:22:33.398+01:00,ns_1@10.114.141.10:service_agent-index<0.669.0>:service_agent:handle_info:287]Rebalancer <13523.17251.1> died unexpectedly: {worker_died, {'EXIT',<13523.17280.1>, {rebalance_failed, {service_error, <<"Unable to read index layout from cluster 127.0.0.1:8091. err = Cannot merge counter with different base values">>}}}} ..
       

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-49079
          # Subject Branch Project Status CR V

          Activity

            kevin.cherkauer Kevin Cherkauer added a comment - - edited

            This is a Planner problem, not a Rebalance problem. The error "Unable to read index layout from cluster" is only logged by planner/executor.go, in 6 different places that are indistinguishable from the message itself:

            1. ExecuteRebalanceInternal()
            2. ExecutePlan()
            3. FindIndexReplicaNodes()
            4. ExecuteReplicaRepair()
            5. ExecuteReplicaDrop()
            6. ExecuteRetrieve()

            (Also logged by cmd/cbindexplan/main.go main() but that is probably not relevant here.)

            The relevant code path for Rebalance is

            rebalancer.go – NewRebalancer()
            rebalancer.go – initRebalAsync()
            executor.go – ExecuteRebalance()
            executor.go – ExecuteRebalanceInternal() – wraps original error in message starting "Unable to read index layout from cluster"
            proxy.go – RetrievePlanFromCluster() – reports error "Cannot merge counter with different base values"
            proxy.go – getIndexNumReplica()
            counter.go – MergeWith() – generates "Cannot merge..." error message

            kevin.cherkauer Kevin Cherkauer added a comment - - edited This is a Planner problem, not a Rebalance problem. The error "Unable to read index layout from cluster" is only logged by planner/executor.go, in 6 different places that are indistinguishable from the message itself: 1. ExecuteRebalanceInternal() 2. ExecutePlan() 3. FindIndexReplicaNodes() 4. ExecuteReplicaRepair() 5. ExecuteReplicaDrop() 6. ExecuteRetrieve() (Also logged by cmd/cbindexplan/main.go main() but that is probably not relevant here.) The relevant code path for Rebalance is rebalancer.go – NewRebalancer() rebalancer.go – initRebalAsync() executor.go – ExecuteRebalance() executor.go – ExecuteRebalanceInternal() – wraps original error in message starting "Unable to read index layout from cluster" proxy.go – RetrievePlanFromCluster() – reports error "Cannot merge counter with different base values" proxy.go – getIndexNumReplica() counter.go – MergeWith() – generates "Cannot merge..." error message
            kevin.cherkauer Kevin Cherkauer added a comment - - edited

            This MB is to make Rebalance not fail if this bug (which is really MB-47874) is hit during Rebalance, since the root cause of MB-47874 has not been found. Also to improve the error message when it gets hit.

            The underlying problem is that different nodes disagree on the number of replicas an index has because that information is inconsistent in metakv. The inconsistent information causes Planner to error out, which means Rebalance also errors out since the planning step failed.

            Since we don't yet have a fix for the generation of the inconsistency, in my discussions with Deepkaran Salooja we decided to remove the index whose metadata is inconsistent from the initial index layout used as Planner input. Then Planner should run. However, this will make the index invisible to Planner so the plan it generates may overload some nodes (especially if that index is large and/or very busy). This is deemed better than failing the Rebalance.

            kevin.cherkauer Kevin Cherkauer added a comment - - edited This MB is to make Rebalance not fail if this bug (which is really MB-47874 ) is hit during Rebalance, since the root cause of MB-47874 has not been found. Also to improve the error message when it gets hit. The underlying problem is that different nodes disagree on the number of replicas an index has because that information is inconsistent in metakv. The inconsistent information causes Planner to error out, which means Rebalance also errors out since the planning step failed. Since we don't yet have a fix for the generation of the inconsistency, in my discussions with Deepkaran Salooja we decided to remove the index whose metadata is inconsistent from the initial index layout used as Planner input. Then Planner should run. However, this will make the index invisible to Planner so the plan it generates may overload some nodes (especially if that index is large and/or very busy). This is deemed better than failing the Rebalance.

            Build couchbase-server-7.1.0-1905 contains indexing commit 12a78d6 with commit message:
            MB-49079 (7.1.0 1861) Rebalance: Omit inconsistent indexes from Planner

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1905 contains indexing commit 12a78d6 with commit message: MB-49079 (7.1.0 1861) Rebalance: Omit inconsistent indexes from Planner

            Kevin Cherkauer how should this be tested functionally ?

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Kevin Cherkauer how should this be tested functionally ?

            Mihir Kamdar I do not see a way to test it functionally as it requires a rare bug to manifest first (which might have already been fixed).

            Dev verifying as requested based on code inspection.

            kevin.cherkauer Kevin Cherkauer added a comment - Mihir Kamdar I do not see a way to test it functionally as it requires a rare bug to manifest first (which might have already been fixed). Dev verifying as requested based on code inspection.

            People

              kevin.cherkauer Kevin Cherkauer
              kevin.cherkauer Kevin Cherkauer
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty