Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.6.0
-
7.6.0-1980
-
Untriaged
-
0
-
Unknown
Description
The test does the following -
Create a 6-node cluster with 1 KV + 5 GSI/Query nodes.
Create buckets/scopes/collections.
Disable shard affinity flag ( enable_shard_affinity is set to False).
Create indexes on default/non-default collections.
Trigger a rebalance ( remove 2 nodes and trigger rebalance).
Enable the shard affinity flag.
Add the 2 nodes that were removed and trigger another rebalance.
After rebalance, it looks like the partitioned indexes have all ended up with 9 partitions (These were created with the default 8 partitions).
From the logs, some important timestamps -
Shard affinity flag was reset at
2024-01-11 14:12:34 | INFO | MainProcess | test_thread | [on_prem_rest_client.set_index_settings] {'indexer.settings.enable_shard_affinity': False} set |
Data load happened around this time
2024-01-11 14:13:39 |
Indexes were created around -
2024-01-11 14:13:57
First rebalance i.e rebalance out 2 nodes was triggered around this time -
2024-01-11 14:19:01 | INFO | MainProcess | test_thread | [gsi_file_based_rebalance.rebalance_and_validate] Rebalance task triggered. Wait in loop until the rebalance starts |
2024-01-11 14:19:01 | INFO | MainProcess | Cluster_Thread | [on_prem_rest_client.rebalance] rebalance params : {'knownNodes': 'ns_1@10.113.223.101,ns_1@10.113.223.102,ns_1@10.113.223.103,ns_1@10.113.223.104,ns_1@10.113.223.105,ns_1@10.113.223.106', 'ejectedNodes': 'ns_1@10.113.223.102,ns_1@10.113.223.103', 'user': 'Administrator', 'password': 'password'} |
This was successfully completed around -
2024-01-11 14:21:37 | INFO | MainProcess | test_thread | [on_prem_rest_client.rebalance_reached] rebalance reached >100% in 143.30752682685852 seconds |
Shard affinity flag was enabled at this time -
2024-01-11 14:24:33 | INFO | MainProcess | test_thread | [on_prem_rest_client.set_index_settings] {'indexer.settings.enable_shard_affinity': True} set |
New nodes were added and rebalance was triggered around this time -
2024-01-11 14:26:19 | INFO | MainProcess | test_thread | [gsi_file_based_rebalance.rebalance_and_validate] Rebalance task triggered. Wait in loop until the rebalance starts |
2024-01-11 14:26:19 | INFO | MainProcess | Cluster_Thread | [task.add_nodes] adding node 10.113.223.102:8091 to cluster |
2024-01-11 14:26:19 | INFO | MainProcess | Cluster_Thread | [on_prem_rest_client.add_node] adding remote node @10.113.223.102:18091 to this cluster @10.113.223.101:8091 |
2024-01-11 14:26:29 | INFO | MainProcess | Cluster_Thread | [on_prem_rest_client.monitorRebalance] rebalance progress took 10.05 seconds |
2024-01-11 14:26:29 | INFO | MainProcess | Cluster_Thread | [on_prem_rest_client.monitorRebalance] sleep for 10 seconds after rebalance... |
2024-01-11 14:26:49 | INFO | MainProcess | Cluster_Thread | [task.add_nodes] adding node 10.113.223.103:8091 to cluster |
2024-01-11 14:26:49 | INFO | MainProcess | Cluster_Thread | [on_prem_rest_client.add_node] adding remote node @10.113.223.103:18091 to this cluster @10.113.223.101:8091 |
2024-01-11 14:26:59 | INFO | MainProcess | Cluster_Thread | [on_prem_rest_client.monitorRebalance] rebalance progress took 10.07 seconds |
Rebalance was completed at -
2024-01-11 14:28:12 | INFO | MainProcess | test_thread | [on_prem_rest_client.rebalance_reached] rebalance reached >100% in 45.82607698440552 seconds |
2024-01-11 14:28:26 | INFO | MainProcess | Cluster_Thread | [task.check] Rebalance - status: none, progress: 100.00% |
The validation for items count has failed at this time -
2024-01-11 14:24:25 | INFO | MainProcess | test_thread | [tuq_helper._find_differences] Diffs {'values_changed': {"root['hotelfb83682d0fcc454b87a964c6a73f845cpartitioned_index']": {'new_value': 2643212, 'old_value': 2640947}, "root['hotelfb83682d0fcc454b87a964c6a73f845cpartitioned_index (replica 1)']": {'new_value': 2637412, 'old_value': 2626385}, "root['hotel937eafe8103f4438a73a1531d2084e06partitioned_index']": {'new_value': 2670010, 'old_value': 2678227}, "root['hotel5239c4ba9eef43f4acd94a30b1889be8partitioned_index']": {'new_value': 2590289, 'old_value': 2583192}, "root['hotel5239c4ba9eef43f4acd94a30b1889be8partitioned_index (replica 2)']": {'new_value': 2607539, 'old_value': 2582524}, "root['hotel012d83107e3a476aa5c43456831dfafdpartitioned_index']": {'new_value': 2625341, 'old_value': 2616588}, "root['hotel937eafe8103f4438a73a1531d2084e06partitioned_index (replica 1)']": {'new_value': 2704019, 'old_value': 2707582}, "root['hotel5239c4ba9eef43f4acd94a30b1889be8partitioned_index (replica 1)']": {'new_value': 2614534, 'old_value': 2582989}, "root['hotel012d83107e3a476aa5c43456831dfafdpartitioned_index (replica 1)']": {'new_value': 2627764, 'old_value': 2619037}}} |
The partitioned indexes all seem to have 9 partitions and this could be the reason for itme count mismatch. The partitioned indexes in question -
hotel937eafe8103f4438a73a1531d2084e06partitioned_index
|
'hotelfb83682d0fcc454b87a964c6a73f845cpartitioned_index
|
hotelfb83682d0fcc454b87a964c6a73f845cpartitioned_index (replica 1) |
hotel937eafe8103f4438a73a1531d2084e06partitioned_index (replica 1) |
cbcollect ->
s3://cb-customers-secure/extrapartition/2024-01-11/archive.zip
Attachments
Issue Links
- is duplicated by
-
MB-60504 [System Test] Rebalance failed while upgrading indexer node
- Closed