Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61043

Partition layout skew after failover(s) + rebalance; must not skip following rebalance ops in case of a skew

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Yes

    Description

      Because of https://review.couchbase.org/c/cbgt/+/196920 , a skewed cluster won't be corrected because we early exit rebalance in case if there is no topology change and there are no missing actives or replicas or worse - even introduce a skew in certain scenarios.

      Proposal:

      We should update the rebalance early exit code to check for such partition count skew, at an index level. Separately, we should iron out other full rebalance paths where an imbalanced outcome can occur or remain from before as cited in the later comments on the CBSE.


      [AD]: The goal here is to ensure that every index is evenly distributed in the cluster, so we shouldn't really need to obtain the full picture at the start of the operation on how the partition distribution should look at the end of the topology change.

      While we focus on obtaining a thorough understanding of how the skew is showing upon an in-place upgrade into 7.2.3, let's also investigate if https://review.couchbase.org/c/cbgt/+/185288 is somehow playing a role here as well. If I'm remembering correctly, this change was specifically added to accommodate the situation where we have more nodes than needed for indexes being introduced - in which case we try to introduce partitions for these indexes on nodes where the counts are lower. So let's not simply try to revert this, but build on how we can accommodate everything that needs accommodating.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              sarthak.dua Sarthak Dua
              mohd.shaadkhan Shaad Khan
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty