Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.6.0
-
Untriaged
-
0
-
Yes
Description
When writing new tests, I see that dropped replicas are not rebuilt on existing indexer service in the cluster when running rebalance (valid for both DCP and shard based rebalance). Attaching logs
Steps to recreate -
- create a cluster with nodes n0 - kv + query, n1 - index, n2 - index
- create "x" (where x > 1) partitioned indices with 1 replica.
- drop at least 1 replica from all indices via alter index command
- run rebalance
Expectation is that the rebalance will recreate the dropped replicas but it does not happen. Rather I see planner errors like (below) and the final layout does not have dropped replicas -
// node :9001 is in the cluster, node :9002 coming in, node :9003 going out
|
|
2024-01-01T05:40:58.105+00:00 [Info] Planner::Fail to create plan satisfying constraint. Re-planning. Num of Try=5. Elapsed Time=352us, err: |
MemoryQuota: 1572864000 |
CpuQuota: 6 |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP__id_balance 5 (replica 1), default, _default, _default> (mem 1.95297M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP__id_balance 2 (replica 1), default, _default, _default> (mem 1.69818M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP__id_balance 1 (replica 1), default, _default, _default> (mem 2.0672M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP_docid_picture 5, default, _default, _default> (mem 2.44296M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP_docid_picture 1, default, _default, _default> (mem 2.57815M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP_docid_picture 3, default, _default, _default> (mem 2.46969M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP_guid_age 5, default, _default, _default> (mem 2.02009M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP_guid_age 3, default, _default, _default> (mem 2.03975M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP_balance_name 2 (replica 1), default, _default, _default> (mem 1.74794M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP_balance_name 1 (replica 1), default, _default, _default> (mem 1.79035M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
--- Violations for index <TestReplicaRepairInMixedModeRebalance_5PTN_1RP_picture_gender 5, default, _default, _default> (mem 1.8401M, cpu 0) at node 127.0.0.1:9003 |
Cannot move to 127.0.0.1:9002: ReplicaViolation (free mem 463.377M, free cpu 6) |
Cannot move to 127.0.0.1:9001: ExcludeNodeViolation (free mem 778.731M, free cpu 6) |
2024-01-01T05:40:58.105+00:00 [Info] Cannot rebuild lost replica due to resource constraint in cluster. Will not rebuild lost replica. |
New test -
func TestReplicaRepairInMixedModeRebalance(t *testing.T) {
|
// t.Skipf("Unstable test")
|
skipShardAffinityTests(t)
|
|
resetCluster(t)
|
addNodeAndRebalance(clusterconfig.Nodes[3], "index", t)
|
clusterutility.SetDataAndIndexQuota(kvaddress, clusterconfig.Username, clusterconfig.Password, "1024", "1024")
|
// clusterutility.SetDataAndIndexQuota(kvaddress, clusterconfig.Username, clusterconfig.Password, "1024", SHARD_AFFINITY_INDEXER_QUOTA)
|
|
status := getClusterStatus()
|
if len(status) != 3 || !isNodeIndex(status, clusterconfig.Nodes[1]) ||
|
!isNodeIndex(status, clusterconfig.Nodes[3]) {
|
t.Fatalf("%v Unexpected cluster configuration: %v", t.Name(), status)
|
}
|
|
// config - [0: kv n1ql] [1: index] [3: index]
|
printClusterConfig(t.Name(), "entry")
|
|
log.Println("*********Setup cluster*********")
|
err := secondaryindex.DropAllNonSystemIndexes(clusterconfig.Nodes[1])
|
tc.HandleError(err, "Failed to drop all non-system indices")
|
|
log.Printf("********Updating `indexer.settings.enable_shard_affinity`=true with node 3 in simulated mixed mode**********")
|
configChanges := map[string]interface{}{
|
// "indexer.settings.enable_shard_affinity": true,
|
// "indexer.planner.honourNodesInDefn": true,
|
// "indexer.thisNodeOnly.ignoreAlternateShardIds": true,
|
"indexer.settings.rebalance.redistribute_indexes": true,
|
}
|
err = secondaryindex.ChangeMultipleIndexerSettings(configChanges, clusterconfig.Username, clusterconfig.Password, clusterconfig.Nodes[3])
|
tc.HandleError(err, fmt.Sprintf("Failed to change config %v", configChanges))
|
|
defer func() {
|
configChanges := map[string]interface{}{
|
"indexer.settings.enable_shard_affinity": false,
|
"indexer.planner.honourNodesInDefn": false,
|
"indexer.settings.rebalance.redistribute_indexes": false,
|
}
|
err := secondaryindex.ChangeMultipleIndexerSettings(configChanges, clusterconfig.Username, clusterconfig.Password, clusterconfig.Nodes[1])
|
tc.HandleError(err, fmt.Sprintf("Failed to change config %v", configChanges))
|
}()
|
|
log.Printf("********Create indices**********")
|
indices := []string{}
|
// create non-deffered partitioned indices
|
for field1 := 0; field1 < 6; field1++ {
|
fieldName1 := fieldNames[field1%len(fieldNames)]
|
fieldName2 := fieldNames[(field1+4)%len(fieldNames)]
|
indexName := t.Name() + "_5PTN_1RP_" + fieldName1 + "_" + fieldName2
|
n1qlStmt := fmt.Sprintf(
|
"create index %v on `%v`(%v, %v) partition by hash(Meta().id) with {\"num_partition\":5, \"num_replica\":1}",
|
indexName, BUCKET, fieldName1, fieldName2)
|
executeN1qlStmt(n1qlStmt, BUCKET, t.Name(), t)
|
indices = append(indices, indexName)
|
}
|
log.Printf("%v %v indices are now active.", t.Name(), indices)
|
|
performClusterStateValidation(t, true)
|
|
dropIndicesMap := make(map[string]int)
|
|
node1meta, err := getLocalMetaWithRetry(clusterconfig.Nodes[1])
|
tc.HandleError(err, "Failed to getLocalMetadata from node 1")
|
|
for _, defn := range node1meta.IndexTopologies[0].Definitions {
|
if len(dropIndicesMap) == 3 {
|
break
|
}
|
if _, exists := dropIndicesMap[defn.Name]; !exists {
|
// pick the replica ID of the first instance
|
dropIndicesMap[defn.Name] = int(defn.Instances[0].ReplicaId)
|
}
|
}
|
|
node3meta, err := getLocalMetaWithRetry(clusterconfig.Nodes[3])
|
tc.HandleError(err, "Failed to getLocalMetadata from node 3")
|
|
log.Printf("********Drop replicas on node 1 and 3**********")
|
|
for _, defn := range node3meta.IndexTopologies[0].Definitions {
|
if len(dropIndicesMap) == 6 {
|
break
|
}
|
if _, exists := dropIndicesMap[defn.Name]; !exists {
|
// pick the replica ID of the first instance
|
dropIndicesMap[defn.Name] = int(defn.Instances[0].ReplicaId)
|
}
|
}
|
|
for idxName, replicaId := range dropIndicesMap {
|
stmt := fmt.Sprintf("alter index %v on %v with {\"action\": \"drop_replica\", \"replicaId\": %v}",
|
idxName, BUCKET, replicaId)
|
executeN1qlStmt(stmt, BUCKET, t.Name(), t)
|
if waitForReplicaDrop(idxName, fmt.Sprintf("%v:%v:%v", BUCKET, "_default", "_default"), replicaId) ||
|
waitForReplicaDrop(idxName, BUCKET, replicaId) {
|
t.Fatalf("%v couldn't drop index %v replica %v", t.Name(), idxName, replicaId)
|
}
|
}
|
|
log.Printf("%v dropped the following index:replica %v", t.Name(), dropIndicesMap)
|
|
performClusterStateValidation(t, true)
|
|
log.Printf("********Swap Rebalance node 3 <=> 2**********")
|
|
swapRebalance(t, 2, 3)
|
indexStatus, err := getIndexStatusFromIndexer()
|
tc.HandleError(err, "idiot")
|
for _, i := range indexStatus.Status {
|
log.Printf("godlog - index %v - %v", i.Name, i.AlternateShardIds)
|
}
|
|
performClusterStateValidation(t, false)
|
}
|
Logs - [^norepair.tar]
Attachments
Issue Links
- is a backport of
-
MB-60517 Dropped replicas not rebuilt in swap rebalance
- Resolved