Details
-
Bug
-
Resolution: Done
-
Blocker
-
7.6.0
-
7.2.3-6705 --> 7.6.0-1767
-
Untriaged
-
Linux x86_64
-
0
-
No
Description
Steps to Repro
1. Run the below longevity test on 7.2.3 for 4-5 days.
./sequoia -client 172.23.104.254:2375 -provider file:centos_third_cluster.yml -test tests/integration/7.2/test_7.2.yml -scope tests/integration/7.2/scope_7.2_magma.yml -scale 3 -repeat 0 -log_level 0 -version 7.2.3-6705 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
|
2. Upgraded all KV node using graceful failover/recovery strategy.
3. Created 5 nodes on 7.6 with provisioned profile(172.23.108.144 ,172.23.97.179, 172.23.104.176 ,172.23.97.183 ,172.23.121.118) add them to the cluster and remove 4 indexing nodes which is part of cluster in 7.2.3 and do rebalance. This is done to simulate upgrade that we would have on cloud so that file based rebalance get used even during upgrade.
172.23.108.144 7:32:56 AM 9 Nov, 2023
Starting rebalance, KeepNodes = ['ns_1@172.23.104.176','ns_1@172.23.104.216',
|
'ns_1@172.23.104.249','ns_1@172.23.105.134',
|
'ns_1@172.23.105.210','ns_1@172.23.105.38',
|
'ns_1@172.23.105.39','ns_1@172.23.105.91',
|
'ns_1@172.23.106.37','ns_1@172.23.107.142',
|
'ns_1@172.23.107.236','ns_1@172.23.107.25',
|
'ns_1@172.23.108.129','ns_1@172.23.108.134',
|
'ns_1@172.23.108.136','ns_1@172.23.108.138',
|
'ns_1@172.23.108.139','ns_1@172.23.108.140',
|
'ns_1@172.23.108.141','ns_1@172.23.108.143',
|
'ns_1@172.23.108.144','ns_1@172.23.108.145',
|
'ns_1@172.23.108.146','ns_1@172.23.108.148',
|
'ns_1@172.23.121.118','ns_1@172.23.97.179',
|
'ns_1@172.23.97.183'], EjectNodes = ['ns_1@172.23.108.61',
|
'ns_1@172.23.108.34',
|
'ns_1@172.23.108.132',
|
'ns_1@172.23.106.54'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 9c7f04f312337cd95f93ff81cd2f539b
|
172.23.108.144 7:33:11 AM 9 Nov, 2023
Rebalance exited with reason {service_rebalance_failed,index,
|
{worker_died,
|
{'EXIT',<0.18034.2>,
|
{{badmatch,
|
{error,
|
{unknown_error,
|
<<"Post \"http://172.23.104.176:9102/registerRebalanceToken\": EOF">>}}},
|
[{service_manager,rebalance_op,5,
|
[{file,"src/service_manager.erl"},
|
{line,341}]},
|
{service_manager,do_run_op,1,
|
[{file,"src/service_manager.erl"},
|
{line,257}]},
|
{proc_lib,init_p,3,
|
[{file,"proc_lib.erl"},{line,225}]}]}}}}.
|
Rebalance Operation Id = 9c7f04f312337cd95f93ff81cd2f539b
|
Rebalance continues to file on repeated retires. I am going to try a few more times. We would have to mark this a blocker if this doesn't progress. cbcollect_info attached.
Any workarounds are highly appreciated.
Attachments
Issue Links
- relates to
-
MB-60435 Add retry mechanism for EOF errors during HTTP calls in Rebalance
- Open