Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-15510

GoXDCR: No checkpointing as a result of incorrect remote node version detection + data loss after failover and add-back

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • 4.0.0
    • 4.0.0
    • XDCR
    • Security Level: Public
    • Untriaged
    • Centos 64-bit
    • No

    Description

      Build


      4.0.0-3321

      Testcase
      --------
      ./testrunner -i INI_FILE.ini -p demand_encryption=1 -t xdcr.biXDCR.bidirectional.load_with_failover_then_add_back,items=10000,ctopology=chain,rdirection=bidirection,standard_buckets=2,expires=60,update=C1-C2,delete=C1-C2,failover=C1,timeout=150

      Steps
      ------
      1. C1 [.45,.46] <---> C2 [.47,.48]
      2. Load 10K keys on all 3 buckets in C1 and C2.
      3. Failover .46 and add back.
      4. Update and delete keys on C1 and C2.
      5. verify keys

      Same test passed in all previous runs.

      .48(.46's remote node) hasn't replicated all its mutations. In addition, it detects .46 as an older version node and refuses to checkpoint.

      [2015-06-30 13:48:35,476] - [xdcrnewbasetests:1829] INFO - Starting failover for nodes:[ip:172.23.106.46 port:8091 ssh_username:root] at C1 cluster 172.23.106.45
      [2015-06-30 13:48:35,990] - [task:2990] INFO - Failing over 172.23.106.46:8091 with graceful=False
      [2015-06-30 13:48:37,164] - [rest_client:1111] INFO - fail_over node ns_1@172.23.106.46 successful
      [2015-06-30 13:48:37,165] - [task:2970] INFO - 0 seconds sleep after failover, for nodes to go pending....
      [2015-06-30 13:48:37,219] - [rest_client:1144] INFO - add_back_node ns_1@172.23.106.46 successful
      [2015-06-30 13:48:38,203] - [rest_client:1166] INFO - rebalance params : password=password&ejectedNodes=&user=Administrator&knownNodes=ns_1%40172.23.106.46%2Cns_1%40172.23.106.45
      [2015-06-30 13:48:38,213] - [rest_client:1170] INFO - rebalance operation started
      [2015-06-30 13:48:38,226] - [rest_client:1288] INFO - rebalance percentage : 0.00 %
      [2015-06-30 13:48:48,247] - [rest_client:1288] INFO - rebalance percentage : 0.00 %
      [2015-06-30 13:48:58,269] - [rest_client:1288] INFO - rebalance percentage : 0.00 %
      [2015-06-30 13:49:08,511] - [rest_client:1288] INFO - rebalance percentage : 5.53 %
      [2015-06-30 13:49:18,532] - [rest_client:1288] INFO - rebalance percentage : 13.89 %
      [2015-06-30 13:49:28,552] - [rest_client:1288] INFO - rebalance percentage : 22.70 %
      [2015-06-30 13:49:38,576] - [rest_client:1288] INFO - rebalance percentage : 27.84 %
      [2015-06-30 13:49:48,607] - [rest_client:1288] INFO - rebalance percentage : 33.22 %
      [2015-06-30 13:49:58,629] - [rest_client:1288] INFO - rebalance percentage : 40.93 %
      [2015-06-30 13:50:08,658] - [rest_client:1288] INFO - rebalance percentage : 48.94 %
      [2015-06-30 13:50:18,683] - [rest_client:1288] INFO - rebalance percentage : 55.77 %
      [2015-06-30 13:50:28,718] - [rest_client:1288] INFO - rebalance percentage : 60.22 %
      [2015-06-30 13:50:38,739] - [rest_client:1288] INFO - rebalance percentage : 65.08 %
      [2015-06-30 13:50:48,764] - [rest_client:1288] INFO - rebalance percentage : 71.25 %
      [2015-06-30 13:50:58,784] - [rest_client:1288] INFO - rebalance percentage : 79.38 %
      [2015-06-30 13:51:08,807] - [rest_client:1288] INFO - rebalance percentage : 87.61 %
      [2015-06-30 13:51:18,830] - [rest_client:1288] INFO - rebalance percentage : 93.68 %
      [2015-06-30 13:51:28,851] - [rest_client:1288] INFO - rebalance percentage : 99.13 %
      [2015-06-30 13:51:38,975] - [task:486] INFO - rebalancing was completed with progress: 100% in 180.761209965 sec
      

      At about the same time, in goxdcr.log on.48(replicating node to .46),

      CheckpointManager 2015-06-30T13:49:35.971-07:00 [INFO] remote bucket is no an older node, no checkpointing should be done.
      CheckpointManager 2015-06-30T13:49:36.088-07:00 [INFO] remote bucket is no an older node, no checkpointing should be done.
      CheckpointManager 2015-06-30T13:49:36.206-07:00 [INFO] remote bucket is no an older node, no checkpointing should be done.
      CheckpointManager 2015-06-30T13:49:36.324-07:00 [INFO] remote bucket is no an older node, no checkpointing should be done.
      CheckpointManager 2015-06-30T13:49:36.441-07:00 [INFO] remote bucket is no an older node, no checkpointing should be done.
      CheckpointManager 2015-06-30T13:49:36.558-07:00 [INFO] remote bucket is no an older node, no checkpointing should be done.
      CheckpointManager 2015-06-30T13:49:36.676-07:00 [INFO] remote bucket is no an older node, no checkpointing should be done.
      :
      :
      :
      CheckpointManager 2015-06-30T14:19:00.705-07:00 [INFO] remote bucket is no an older node, no checkpointing should be done
      

      Blocker since .48 holds some mutations that it hasn't replicated to .46 for over 25 mins -

      StatisticsManager 2015-06-30T13:54:55.027-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      StatisticsManager 2015-06-30T13:54:56.008-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      StatisticsManager 2015-06-30T13:54:57.021-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      StatisticsManager 2015-06-30T13:54:58.008-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      StatisticsManager 2015-06-30T13:54:59.015-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      StatisticsManager 2015-06-30T13:55:00.020-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      StatisticsManager 2015-06-30T13:55:01.012-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      :
      :
      changes_left=4500
      StatisticsManager 2015-06-30T14:18:58.021-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      StatisticsManager 2015-06-30T14:18:59.015-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      StatisticsManager 2015-06-30T14:19:00.043-07:00 [INFO] fe1992e3d03ba1c67fbd747f8fe0506f/standard_bucket_2/standard_bucket_2 total_docs=19000, docs_processed=14500, changes_left=4500
      

      It is possible these two may not be related but from the timing it appears all hell breaks loose after the failover and add-back.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              apiravi Aruna Piravi (Inactive)
              apiravi Aruna Piravi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty