Description
Scenario
-------------
-Two 2 node clusters with uni-XDCR on one bucket, checkpointing interval = 60s
- Load 1 mutation onto vb0 at source after 60s. One checkpoint happens. Do the same again.
- Now failover the dest node containing vb0. Rebalance out the node from cluster. Dest vb_uuid changes.
- After 60s insert another key onto source vb0. Now checkpointing will fail with 404 error because it tries to contact the node that was removed.
- Soon after this source gets the new node details and pre-replicates with the last saved checkpoint on source but pre-replication is successful.
Expectation
----------------
Source pre_replicates with the last saved checkpoint which contains the old_vbuuid. So pre_replicate must fail with 400 error with target node sharing its new vb_uuid. If the new node telling the source is the only way the latter can know about the vb_uuid change, the first pre_replicate to the new node should fail.
Test-case to reproduce the problem
-------------------------------------------------
./testrunner -i xdcr.ini -t xdcr.checkpointXDCR.XDCRCheckpointUnitTest.test_failover,failover=destination,rdirection=unidirection,topology=chain,replication_type=xmem
Test log
----------
2014-06-26 18:33:46 | INFO | MainProcess | test_thread | [checkpointXDCR.failover_activevb0_node] Remote uuid before failover :160056074149491, after failover : 43366546246727
2014-06-26 18:35:06 | INFO | MainProcess | test_thread | [checkpointXDCR.tearDown] Checkpoints recorded in this run -
{u'total_docs_checked': 0, u'upr_snapshot_end_seqno': 0, u'upr_snapshot_seqno': 0, u'seqno': 0, u'start_time': u'Fri, 27 Jun 2014 01:26:51 GMT', u'total_data_replicated': 0, u'commitopaque': [160056074149491, 1], u'total_docs_written': 0, u'end_time': u'Fri, 27 Jun 2014 01:28:11 GMT', u'failover_uuid': 0} {u'total_docs_checked': 1, u'upr_snapshot_end_seqno': 1, u'upr_snapshot_seqno': 1, u'seqno': 1, u'start_time': u'Fri, 27 Jun 2014 01:26:51 GMT', u'total_data_replicated': 9, u'commitopaque': [160056074149491, 2], u'total_docs_written': 1, u'end_time': u'Fri, 27 Jun 2014 01:29:25 GMT', u'failover_uuid': 31192447049134}Logs
------
1. Pls note there were no _pre_replicate 400 errors seen in couchdb log on the new node.
2. xdcr_trace.log at source clearly shows how source picks up the last commit opaque seen above and pre_replicates, yet destination node calls it a success and returns new vb_uuid disregarding the vbopaque mismatch -
{"pid":"<0.3548.0>","type":"terminate","ts":1403832681.670493,"reason":"{failed_to_commit_checkpoint,{error,404,\n <<\"
{\\\"error\\\":\\\"not_found\\\",\\\"reason\\\":\\\"missing\\\"}\\n\">>}}","loc":"xdc_vbucket_rep:terminate:482"}
{"pid":"<0.6387.0>","type":"sendingChanges","ts":1403832681.670814,"length":1,"toWorker":"<0.6388.0>","lastSeq":3,"lastSnapshotStart":2,"lastSnapshotEnd":3,"loc":"xdc_vbucket_rep:changes_manager_loop_open:1059"} {"pid":"<0.6388.0>","type":"gotChanges","ts":1403832681.67225,"reportSeq":3,"snapshotStart":2,"snapshotEnd":3,"count":1,"loc":"xdc_vbucket_rep_worker:queue_fetch_loop:51"} {"pid":"<0.6388.0>","type":"missing","ts":1403832681.672351,"startTS":1403832681.672331,"k":[["pymc2329",3,1]],"loc":"xdc_vbucket_rep_worker:find_missing:238"} {"pid":"<0.6388.0>","type":"foundMissing","ts":1403832681.672437,"count":1,"loc":"xdc_vbucket_rep_worker:queue_fetch_loop:59"} {"pid":"<0.6396.0>","type":"init","ts":1403832681.681136,"repID":"1fd9822e6a7941259155b0e372b95c1f/default/default","vb":0,"parent":"<0.3210.0>","loc":"xdc_vbucket_rep:init:83"} {"pid":"<0.6396.0>","type":"gotInitToken","ts":1403832681.681215,"loc":"xdc_vbucket_rep:handle_info:128"} {"pid":"<0.6396.0>","type":"gotRemoteVBucketDetails","ts":1403832681.683397,"bucket":"default","bucketUUID":"d34accdd7d0065d670eaff1ea84a932c","host":"10.3.4.189","port":11210,"supportsDatatype":false,"haveCert":false,"sslProxyPort":"remote_proxy_port","loc":"xdc_vbucket_rep:init_replication_state:640"}{"pid":"<0.6396.0>","type":"gotExistingCheckpoint","ts":1403832681.684562,"body":
{"commitopaque":[160056074149491,2],"start_time":"Fri, 27 Jun 2014 01:26:51 GMT","end_time":"Fri, 27 Jun 2014 01:29:25 GMT","failover_uuid":31192447049134,"seqno":1,"upr_snapshot_seqno":1,"upr_snapshot_end_seqno":1,"total_docs_checked":1,"total_docs_written":1,"total_data_replicated":9},"loc":"xdc_vbucket_rep_ckpt:do_parse_validate_checkpoint_doc:344"}
{"pid":"<0.6396.0>","type":"preReplicateOK","ts":1403832681.690969,"vbopaque":43366546246727,"loc":"xdc_vbucket_rep_ckpt:perform_pre_replicate:270"} {"pid":"<0.6396.0>","type":"initedSeq","ts":1403832681.691391,"seq":1,"snapshotStart":1,"snapshotEnd":1,"failoverUUUID":31192447049134,"remoteVBOpaque":43366546246727,"totalDocsChecked":1,"totalDocsWritten":1,"totalDataReplicated":9,"loc":"xdc_vbucket_rep:init_replication_state:680"}Cbcollect attached -
Source : .186, .187
Dest : .188, .189 .188 was failed over, .189 is the new node containing vb0.