Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-11571

XDCR checkpointing: post destination failover/crash, first pre_replicate to new node containing new vb_uuid is successful

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • 3.0
    • 3.0
    • XDCR
    • Security Level: Public
    • None
    • centOS
    • Untriaged
    • No

    Description

      Scenario
      -------------
      -Two 2 node clusters with uni-XDCR on one bucket, checkpointing interval = 60s

      • Load 1 mutation onto vb0 at source after 60s. One checkpoint happens. Do the same again.
      • Now failover the dest node containing vb0. Rebalance out the node from cluster. Dest vb_uuid changes.
      • After 60s insert another key onto source vb0. Now checkpointing will fail with 404 error because it tries to contact the node that was removed.
      • Soon after this source gets the new node details and pre-replicates with the last saved checkpoint on source but pre-replication is successful.

      Expectation
      ----------------
      Source pre_replicates with the last saved checkpoint which contains the old_vbuuid. So pre_replicate must fail with 400 error with target node sharing its new vb_uuid. If the new node telling the source is the only way the latter can know about the vb_uuid change, the first pre_replicate to the new node should fail.

      Test-case to reproduce the problem
      -------------------------------------------------
      ./testrunner -i xdcr.ini -t xdcr.checkpointXDCR.XDCRCheckpointUnitTest.test_failover,failover=destination,rdirection=unidirection,topology=chain,replication_type=xmem

      Test log
      ----------
      2014-06-26 18:33:46 | INFO | MainProcess | test_thread | [checkpointXDCR.failover_activevb0_node] Remote uuid before failover :160056074149491, after failover : 43366546246727

      2014-06-26 18:35:06 | INFO | MainProcess | test_thread | [checkpointXDCR.tearDown] Checkpoints recorded in this run -

      {u'total_docs_checked': 0, u'upr_snapshot_end_seqno': 0, u'upr_snapshot_seqno': 0, u'seqno': 0, u'start_time': u'Fri, 27 Jun 2014 01:26:51 GMT', u'total_data_replicated': 0, u'commitopaque': [160056074149491, 1], u'total_docs_written': 0, u'end_time': u'Fri, 27 Jun 2014 01:28:11 GMT', u'failover_uuid': 0} {u'total_docs_checked': 1, u'upr_snapshot_end_seqno': 1, u'upr_snapshot_seqno': 1, u'seqno': 1, u'start_time': u'Fri, 27 Jun 2014 01:26:51 GMT', u'total_data_replicated': 9, u'commitopaque': [160056074149491, 2], u'total_docs_written': 1, u'end_time': u'Fri, 27 Jun 2014 01:29:25 GMT', u'failover_uuid': 31192447049134}

      Logs
      ------
      1. Pls note there were no _pre_replicate 400 errors seen in couchdb log on the new node.
      2. xdcr_trace.log at source clearly shows how source picks up the last commit opaque seen above and pre_replicates, yet destination node calls it a success and returns new vb_uuid disregarding the vbopaque mismatch -

      {"pid":"<0.3548.0>","type":"gotCommitError","ts":1403832681.669864,"statusCode":404,"loc":"xdc_vbucket_rep_ckpt:perform_commit_for_checkpoint:247"} {"pid":"<0.3548.0>","type":"checkpointFailed","ts":1403832681.669925,"loc":"xdc_vbucket_rep_ckpt:do_checkpoint:143"}

      {"pid":"<0.3548.0>","type":"terminate","ts":1403832681.670493,"reason":"{failed_to_commit_checkpoint,{error,404,\n <<\"

      {\\\"error\\\":\\\"not_found\\\",\\\"reason\\\":\\\"missing\\\"}

      \\n\">>}}","loc":"xdc_vbucket_rep:terminate:482"}

      {"pid":"<0.6387.0>","type":"sendingChanges","ts":1403832681.670814,"length":1,"toWorker":"<0.6388.0>","lastSeq":3,"lastSnapshotStart":2,"lastSnapshotEnd":3,"loc":"xdc_vbucket_rep:changes_manager_loop_open:1059"} {"pid":"<0.6388.0>","type":"gotChanges","ts":1403832681.67225,"reportSeq":3,"snapshotStart":2,"snapshotEnd":3,"count":1,"loc":"xdc_vbucket_rep_worker:queue_fetch_loop:51"} {"pid":"<0.6388.0>","type":"missing","ts":1403832681.672351,"startTS":1403832681.672331,"k":[["pymc2329",3,1]],"loc":"xdc_vbucket_rep_worker:find_missing:238"} {"pid":"<0.6388.0>","type":"foundMissing","ts":1403832681.672437,"count":1,"loc":"xdc_vbucket_rep_worker:queue_fetch_loop:59"} {"pid":"<0.6396.0>","type":"init","ts":1403832681.681136,"repID":"1fd9822e6a7941259155b0e372b95c1f/default/default","vb":0,"parent":"<0.3210.0>","loc":"xdc_vbucket_rep:init:83"} {"pid":"<0.6396.0>","type":"gotInitToken","ts":1403832681.681215,"loc":"xdc_vbucket_rep:handle_info:128"} {"pid":"<0.6396.0>","type":"gotRemoteVBucketDetails","ts":1403832681.683397,"bucket":"default","bucketUUID":"d34accdd7d0065d670eaff1ea84a932c","host":"10.3.4.189","port":11210,"supportsDatatype":false,"haveCert":false,"sslProxyPort":"remote_proxy_port","loc":"xdc_vbucket_rep:init_replication_state:640"}

      {"pid":"<0.6396.0>","type":"gotExistingCheckpoint","ts":1403832681.684562,"body":

      {"commitopaque":[160056074149491,2],"start_time":"Fri, 27 Jun 2014 01:26:51 GMT","end_time":"Fri, 27 Jun 2014 01:29:25 GMT","failover_uuid":31192447049134,"seqno":1,"upr_snapshot_seqno":1,"upr_snapshot_end_seqno":1,"total_docs_checked":1,"total_docs_written":1,"total_data_replicated":9}

      ,"loc":"xdc_vbucket_rep_ckpt:do_parse_validate_checkpoint_doc:344"}

      {"pid":"<0.6396.0>","type":"preReplicateOK","ts":1403832681.690969,"vbopaque":43366546246727,"loc":"xdc_vbucket_rep_ckpt:perform_pre_replicate:270"} {"pid":"<0.6396.0>","type":"initedSeq","ts":1403832681.691391,"seq":1,"snapshotStart":1,"snapshotEnd":1,"failoverUUUID":31192447049134,"remoteVBOpaque":43366546246727,"totalDocsChecked":1,"totalDocsWritten":1,"totalDataReplicated":9,"loc":"xdc_vbucket_rep:init_replication_state:680"}

      Cbcollect attached -

      Source : .186, .187
      Dest : .188, .189 .188 was failed over, .189 is the new node containing vb0.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            apiravi Aruna Piravi (Inactive)
            apiravi Aruna Piravi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty