Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.6.0, 7.2.3
-
Untriaged
-
0
-
Yes
Description
Because of https://review.couchbase.org/c/cbgt/+/196920 , a skewed cluster won't be corrected because we early exit rebalance in case if there is no topology change and there are no missing actives or replicas or worse - even introduce a skew in certain scenarios.
Proposal:
We should update the rebalance early exit code to check for such partition count skew, at an index level. Separately, we should iron out other full rebalance paths where an imbalanced outcome can occur or remain from before as cited in the later comments on the CBSE.
[AD]: The goal here is to ensure that every index is evenly distributed in the cluster, so we shouldn't really need to obtain the full picture at the start of the operation on how the partition distribution should look at the end of the topology change.
While we focus on obtaining a thorough understanding of how the skew is showing upon an in-place upgrade into 7.2.3, let's also investigate if https://review.couchbase.org/c/cbgt/+/185288 is somehow playing a role here as well. If I'm remembering correctly, this change was specifically added to accommodate the situation where we have more nodes than needed for indexes being introduced - in which case we try to introduce partitions for these indexes on nodes where the counts are lower. So let's not simply try to revert this, but build on how we can accommodate everything that needs accommodating.
Attachments
Issue Links
- is duplicated by
-
MB-61281 The source partitions are not evenly distributed among nodes after rebalance
- Closed
- relates to
-
MB-58450 Failure to rebuild partitions for search indexes after node failover
- Closed
-
MB-59054 Don't skip rebalance if there is a container change for any fts node.
- Closed
-
MB-61185 [trinity] Incorrect computation of nodes to add and remove list
- Closed
-
MB-61981 Ensure even partition-node assignment after failover-recovery
- Closed
-
MB-62075 [Backport] Incorrect computation of nodes to add and remove list
- Closed
-
MB-60727 FTS re-ingest data on dropping replicas from 1 -> 0
- Closed
- links to