Details
Description
Rebalance operation is hanged after add back a failovered node. Test is always failing.
http://qa.hq.northscale.net/job/centos_x64--31_01--uniXDCR-P1/1/consoleFull
[Test Case]
./testrunner i centos_x6431_01-uniXDCR-P1.ini GROUP=CHAIN,get-cbcollect-info=True,get-logs=False,stop-on-failure=False -t xdcr.uniXDCR.unidirectional.load_with_failover_then_add_back,items=100000,rdirection=unidirection,ctopology=chain,doc-ops=update-delete,failover=destination,GROUP=CHAIN;P1
[2014-05-09 14:13:27,247] - [uniXDCR:189] INFO - Failing over Destination Non-Master Node 10.3.3.210:8091
[2014-05-09 14:13:28,544] - [task:2229] INFO - Failing over 10.3.3.210:8091
[2014-05-09 14:13:28,971] - [rest_client:1029] INFO - fail_over node ns_1@10.3.3.210 successful
[2014-05-09 14:13:28,973] - [task:2209] INFO - 20 seconds sleep after failover, for nodes to go pending....
[2014-05-09 14:13:48,994] - [uniXDCR:192] INFO - Add back Destination Non-Master Node 10.3.3.210:8091
[2014-05-09 14:13:49,386] - [rest_client:1062] INFO - add_back_node ns_1@10.3.3.210 successful
[2014-05-09 14:13:50,563] - [rest_client:1076] INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.3.121.65%2Cns_1%4010.3.3.210%2Cns_1%4010.3.3.209%2Cns_1%4010.3.3.207
[2014-05-09 14:13:50,691] - [rest_client:1080] INFO - rebalance operation started
[2014-05-09 14:13:50,969] - [rest_client:1181] INFO - rebalance percentage : 0 %
[2014-05-09 14:14:01,213] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:14:11,365] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:14:21,751] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:14:32,462] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:14:42,777] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:14:53,214] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:15:04,233] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:15:15,050] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:15:25,177] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:15:35,811] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:15:46,200] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:15:56,661] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:16:07,376] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:16:17,740] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:16:28,131] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:16:38,672] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:16:48,878] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:16:59,706] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
[2014-05-09 14:17:10,547] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
1. Setup 4 -4 Node Source and Destination cluster.
2. Load 1 M items on source side.
3. Failover non-master node at destination.
4. add back node.
5. Rebalance. -> Rebalance stuck. Issue is always reproducible with 662 build.
[Note] ->
1. XDCR was non-UPR in this case. Only intra-cluster replication was using UPR.
2. Issue is occurring with large number of items i.e. 1M, test is passed with lesser items e.g. 1K, 10K or so.