Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55628

Planner - Rebalance fail : incorrect violation check when equivalent and replica index exist across multiple Server Groups

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      Background:

      The incorrect check happens when an Indexer Node is moving out i.e. we have non-zero OutIndexes. The rebalance failure is hit when for any one of the outIndex, the number of equivalent index is <= numLiveNode and there is atleast a ServerGroup with no replica of this index BUT the Server Group with no replica has Equivalent index on all the nodes in this SG.

      Why?

      1. As numEquivIndex <= numLiveNode we never make suppressEquivIndex Check to be true
      2. When the Index is getting placed on a SG with Replica present, and as atleast one SG is available with No Replica -> Planner raises ServerGroupViolations or ReplicaViolation. Planner believes the SG with No Replica should be the receipt of this Index, maintaining HA across ServerGroups
      3. When the index placement on a Node present in SG with no replica is tried, and if the Node has an equivalent Index -> Planner raises EquivalentIndex Violation as suppressEquivIndex is not set to true [as mentioned in pt (1)]

       


      Steps to reproduce:

      Creating a cluster with 6 nodes, where the configuration is:
      n0: data +n1ql, n1: index, n2: index, n3: index, n4: index, n5: index

      and we define the ServerGroup configuration as :
      Grp1: [n0, n1],  Grp2: [n2],  Grp3: [n3, n4, n5]

      We create Index idx1 with 1 replica on Node n1 & n5 ; and an Equivalent Index "idx2" with 0 replica on Node n2.

      Then on removal of Node n1, the planner is not able to find any "valid" node in the cluster where it could place the idx1 and the Rebalance fails due to the following error 

      "Rebalance exited with reason {service_rebalance_failed,index,\n                              {worker_died,\n                               {'EXIT',<0.27043.1>,\n                                {task_failed,rebalance,\n                                 {service_error,\n                                  <<\"\\nMemoryQuota: 1048576000\\nCpuQuota: 10\\n--- Violations for index <airport_city_replica_01_05 0, travel-sample, inventory, airport> (mem 185.982K, cpu 0) at node 127.0.0.1:9001 \\n\\tCannot move to 127.0.0.1:9002: EquivIndexViolation (free mem 839.1M, free cpu 9.999316252979229)\\n\\tCannot move to 127.0.0.1:9005: ReplicaViolation (free mem 680.248M, free cpu 9.998483808123252)\\n\\tCannot move to 127.0.0.1:9003: ServerGroupViolation (free mem 848.644M, free cpu 9.99933331111037)\\n\\tCannot move to 127.0.0.1:9004: ServerGroupViolation (free mem 822.427M, free cpu 9.999266642221407)\\n\">>}}}}}. 

      i.e. It is not able to place on
      a) Node n3 : Due to ServerGroupViolation ( as idx1 replica exists on n5 in the Same SG)
      b) Node n4: Due to ServerGroupViolation( as idx1 replica exists on n5 in the Same SG)
      c) Node n5: ReplicaViolation
      d) Node n2: EquivIndexViolation

      Server Group Violation arose as the Planner was seeing that no replica is on Grp2 and Hence it thought it was better to place {{idx1 }}on Grp 2 for HA. But because the suppress Equivalent Index Check wasn't suppressed as CountOfEquivIndex < LiveNumNodes in the planner calculation, We raised a violation at Grp2 as well.

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              yash.dodderi Yash Dodderi
              shivansh.rustagi Shivansh Rustagi
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty