Details
-
Bug
-
Resolution: Fixed
-
Major
-
7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1, 7.1.5, 7.2.2
-
Untriaged
-
0
-
No
Description
Test 2 Source side
2023-11-10T19:50:13.418Z INFO GOXDCR.CheckpointMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 Received rollback from DCP stream vb=190, rollbackseqno=0
|
2023-11-10T19:50:13.202655+00:00 WARNING 70: (b1) DCP (Producer) eq_dcpq:xdcr:dcp_9a26b5281c0bb68d3565f02dedbe13e3/b1/b2_127.0.0.1:11210_1:eZKvS2DDiE69D9iifDADsw== - (vb:634) Stream request requires rollback to seqno:0 because vBucket UUID not found in failover table, consumer and producer have no common history. Client requested seqnos:{1,18446744073709551615} snapshot:{0,1} uuid:270389697156690
|
We can see that VBUUID did indeed diverged.
2023-11-10T19:50:14.816Z INFO GOXDCR.CheckpointMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 Rolled back startSeqno to 0 for vb=36
|
2023-11-10T19:50:14.816Z INFO GOXDCR.CheckpointMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 Retry vbts=[vbno=36, uuid=27833190048882, seqno=0, sn_start=0, sn_end=0, srcManId=0, tgtManId=0]
|
…
|
2023-11-10T19:51:43.604Z INFO GOXDCR.StatsMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 total_docs=0, docs_processed=0, changes_left=0
|
After the rollback, we indeed see that the source bucket has no document per Ephemeral bucket failover.
After this point, we need to examine the target.
On the target, we see VBUUID changed from prior to failover:
2023-11-10T19:43:56.815Z INFO GOXDCR.CheckpointMgr: high_seqno_and_vbuuid_map=map[0:[2 154426843721660] 1:[3 210277909393215] 2:[1 84684117587385] 3:[1 8290499089518] 4:[1 196910853336884] 5:[0 199740622769747] 6:[3 119872586875387] 7:[4 222892142227865] 8:[4 95022217696648] 9:[3 39012357378059] 10:[1 211788290813736] 11:[1 34673846444505] 12:[0 228661520770186] 13:[1 166863773917663] 14:[3 137415014364077] 15:[2 77241665392177] 16:[0 39318197108351] 17:[1 51085109810178] 18:[2 254941462668812] 19:[3 216767593795803] 20:[3 212991720192003] 21:[4 153567376942254] 22:[2 270555470095135] 23:[0 249286862309245] 24:[2 134403816601229] 25:[0 124606719495893] 26:[4 224351320372821] 27:[3 225902359461913] 28:[3 62391942643225] 29:[2 244965601549711] 30:[0 206364212072488] 31:[1 237853971712258] 32:[1 41949912824453] 33:[1 59002330479586] 34:[3 140832479357498] 35:[2 223724436483744] 36:[5 27833190048882] 37:[3 81878845313773] 38:[0 249417699123833] 39:[2 187641062205983] 40:[2 202554840050771] 41:[2 157974071400220] 42:[3 102706939829461] 43:[5 253384948749306
|
with some connectivity issues in between:
2023-11-10T19:53:56.776Z WARN GOXDCR.CheckpointMgr: MainPipeline cae0a545b06006cf27f38e63a52f3130/b2/b1 Error getting vbucket-seqno stats for serverAddr=172.31.45.240:11210. vbnos=[0 1 2 3 4 …1022 1023], err=write tcp 172.31.44.32:37796->172.31.45.240:11210: write: broken pipe
|
to post failover:
2023-11-10T19:53:57.020Z INFO GOXDCR.CheckpointMgr: high_seqno_and_vbuuid_map=map[0:[0 126957815739762] 1:[0 73656497180216] 2:[0 210882170864119] 3:[0 227669496888839] 4:[0 234892281058358] 5:[0 77855964368367] 6:[0 94592576861185] 7:[0 97404996130780] 8:[0 202473979135397] 9:[0 137334799062834] 10:[0 49715392751982] 11:[0 77039012834310] 12:[0 143315529511933] 13:[0 256444867542785] 14:[0 98071247603780] 15:[0 118524803845585] 16:[0 54603363440149] 17:[0 99076048686569] 18:[0 197290223413343] 19:[0 252515855428545] 20:[0 165203606207565]
|
The odd thing is that this should have triggered a restart but it did not.
Diving in deeper by comparing the logs between test 1 and test 2, I realized that the timing of the situation is different. Test 1 has the situation where checkpoint manager has not done any checkpointing at all, then the failover occurs on the target. For test 2, at least one successful checkpoint operation has occurred, then the failover happens on the target. That’s the difference I can see.
This is most likely due to the fact that no mutation was written, meaning that the through sequence number did not move. So, XDCR checkpoint manager will skip the actual checkpointing operation. When it skips the checkpointing, it also skips the VBUUID check verification. I suspect that if one more mutation was written, it would cause VBUUID detection to occur.
I was able to write a test and verify this theory and reproduce the issue, such that Checkpoint Manager was aware of the target vbuuid change but did not restart the pipeline.
By writing one more document, we got the actual restart:
2023-11-15T13:43:34.277-08:00 INFO GOXDCR.CheckpointMgr: pipelineFullTopic=5bc716dfb2b79f1429bf75a391164885/B1/B2, Errors encountered in checkpointing: map[406:target vbuuid has changed]
|
…
|
2023-11-15T13:43:34.418-08:00 INFO GOXDCR.PipelineMgr: Replication Status=name={5bc716dfb2b79f1429bf75a391164885/B1/B2}, status={Pending}, errors={[{"time":"2023-11-15T13:43:34.118729-08:00","errMsg":"TopoChangeDet : target vbuuid has changed"}]}, oldProgress={Source nozzles have been closed}, progress={Pipeline has been stopped}
|
Attachments
Issue Links
- is a backport of
-
MB-59669 XDCR - checkpointMgr failover detection only works if seqno progressed
- Closed