Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-11088

[UPR] Rebalance hang after add back node

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • 3.0
    • 3.0
    • ns_server
    • Security Level: Public
    • None
    • Build 3.0.0-662-rel

    Description

      Rebalance operation is hanged after add back a failovered node. Test is always failing.

      http://qa.hq.northscale.net/job/centos_x64--31_01--uniXDCR-P1/1/consoleFull

      [Test Case]
      ./testrunner i centos_x6431_01-uniXDCR-P1.ini GROUP=CHAIN,get-cbcollect-info=True,get-logs=False,stop-on-failure=False -t xdcr.uniXDCR.unidirectional.load_with_failover_then_add_back,items=100000,rdirection=unidirection,ctopology=chain,doc-ops=update-delete,failover=destination,GROUP=CHAIN;P1

      [2014-05-09 14:13:27,247] - [uniXDCR:189] INFO - Failing over Destination Non-Master Node 10.3.3.210:8091
      [2014-05-09 14:13:28,544] - [task:2229] INFO - Failing over 10.3.3.210:8091
      [2014-05-09 14:13:28,971] - [rest_client:1029] INFO - fail_over node ns_1@10.3.3.210 successful
      [2014-05-09 14:13:28,973] - [task:2209] INFO - 20 seconds sleep after failover, for nodes to go pending....
      [2014-05-09 14:13:48,994] - [uniXDCR:192] INFO - Add back Destination Non-Master Node 10.3.3.210:8091
      [2014-05-09 14:13:49,386] - [rest_client:1062] INFO - add_back_node ns_1@10.3.3.210 successful
      [2014-05-09 14:13:50,563] - [rest_client:1076] INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%4010.3.121.65%2Cns_1%4010.3.3.210%2Cns_1%4010.3.3.209%2Cns_1%4010.3.3.207
      [2014-05-09 14:13:50,691] - [rest_client:1080] INFO - rebalance operation started
      [2014-05-09 14:13:50,969] - [rest_client:1181] INFO - rebalance percentage : 0 %
      [2014-05-09 14:14:01,213] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:14:11,365] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:14:21,751] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:14:32,462] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:14:42,777] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:14:53,214] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:15:04,233] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:15:15,050] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:15:25,177] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:15:35,811] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:15:46,200] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:15:56,661] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:16:07,376] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:16:17,740] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:16:28,131] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:16:38,672] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:16:48,878] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:16:59,706] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %
      [2014-05-09 14:17:10,547] - [rest_client:1181] INFO - rebalance percentage : 14.4747759205 %

      1. Setup 4 -4 Node Source and Destination cluster.
      2. Load 1 M items on source side.
      3. Failover non-master node at destination.
      4. add back node.
      5. Rebalance. -> Rebalance stuck. Issue is always reproducible with 662 build.

      [Note] ->
      1. XDCR was non-UPR in this case. Only intra-cluster replication was using UPR.
      2. Issue is occurring with large number of items i.e. 1M, test is passed with lesser items e.g. 1K, 10K or so.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            mikew Mike Wiederhold [X] (Inactive)
            sangharsh Sangharsh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty