Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59881

Planner - Rebalance fail : incorrect violation check

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      The ticket focuses on edge cases from the cloned ticket namely:

      1. Multiple nodes are getting removed in the same rebalance
      2. If any outgoing indexes also have it's equivalent indexes present on the deleteNode

      Background:

      The incorrect check happens when an Indexer Node is moving out i.e. we have non-zero OutIndexes. The rebalance failure is hit when for any one of the outIndex, the number of equivalent index is <= numLiveNode and there is atleast a ServerGroup with no replica of this index BUT the Server Group with no replica has Equivalent index on all the nodes in this SG.

      Why?

      1. As numEquivIndex <= numLiveNode we never make suppressEquivIndex Check to be true
      2. When the Index is getting placed on a SG with Replica present, and as atleast one SG is available with No Replica -> Planner raises ServerGroupViolations or ReplicaViolation. Planner believes the SG with No Replica should be the receipt of this Index, maintaining HA across ServerGroups
      3. When the index placement on a Node present in SG with no replica is tried, and if the Node has an equivalent Index -> Planner raises EquivalentIndex Violation as suppressEquivIndex is not set to true [as mentioned in pt (1)]

       


      Steps to reproduce:

      Creating a cluster with 6 nodes, where the configuration is:
      n0: data +n1ql, n1: index, n2: index, n3: index, n4: index, n5: index

      and we define the ServerGroup configuration as :
      Grp1: [n0, n1],  Grp2: [n2],  Grp3: [n3, n4, n5]

      We create Index idx1 with 1 replica on Node n1 & n5 ; and an Equivalent Index "idx2" with 0 replica on Node n2.

      Then on removal of Node n1, the planner is not able to find any "valid" node in the cluster where it could place the idx1 and the Rebalance fails due to the following error 

      "Rebalance exited with reason {service_rebalance_failed,index,\n                              {worker_died,\n                               {'EXIT',<0.27043.1>,\n                                {task_failed,rebalance,\n                                 {service_error,\n                                  <<\"\\nMemoryQuota: 1048576000\\nCpuQuota: 10\\n--- Violations for index <airport_city_replica_01_05 0, travel-sample, inventory, airport> (mem 185.982K, cpu 0) at node 127.0.0.1:9001 \\n\\tCannot move to 127.0.0.1:9002: EquivIndexViolation (free mem 839.1M, free cpu 9.999316252979229)\\n\\tCannot move to 127.0.0.1:9005: ReplicaViolation (free mem 680.248M, free cpu 9.998483808123252)\\n\\tCannot move to 127.0.0.1:9003: ServerGroupViolation (free mem 848.644M, free cpu 9.99933331111037)\\n\\tCannot move to 127.0.0.1:9004: ServerGroupViolation (free mem 822.427M, free cpu 9.999266642221407)\\n\">>}}}}}. 

      i.e. It is not able to place on
      a) Node n3 : Due to ServerGroupViolation ( as idx1 replica exists on n5 in the Same SG)
      b) Node n4: Due to ServerGroupViolation( as idx1 replica exists on n5 in the Same SG)
      c) Node n5: ReplicaViolation
      d) Node n2: EquivIndexViolation

      Server Group Violation arose as the Planner was seeing that no replica is on Grp2 and Hence it thought it was better to place {{idx1 }}on Grp 2 for HA. But because the suppress Equivalent Index Check wasn't suppressed as CountOfEquivIndex < LiveNumNodes in the planner calculation, We raised a violation at Grp2 as well.

       

       

      Attachments

        Issue Links

          Activity

            People

              amit.kulkarni Amit Kulkarni
              shivansh.rustagi Shivansh Rustagi
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                PagerDuty