Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 7.6.0
Affects Version/s: 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1, 7.1.5, 7.2.2
Component/s: XDCR
Labels:
- releasenotecomplete

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
No

Description

Test 2 Source side

2023-11-10T19:50:13.418Z INFO GOXDCR.CheckpointMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 Received rollback from DCP stream vb=190, rollbackseqno=0

2023-11-10T19:50:13.202655+00:00 WARNING 70: (b1) DCP (Producer) eq_dcpq:xdcr:dcp_9a26b5281c0bb68d3565f02dedbe13e3/b1/b2_127.0.0.1:11210_1:eZKvS2DDiE69D9iifDADsw== - (vb:634) Stream request requires rollback to seqno:0 because vBucket UUID not found in failover table, consumer and producer have no common history. Client requested seqnos:{1,18446744073709551615} snapshot:{0,1} uuid:270389697156690

We can see that VBUUID did indeed diverged.

2023-11-10T19:50:14.816Z INFO GOXDCR.CheckpointMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 Rolled back startSeqno to 0 for vb=36

2023-11-10T19:50:14.816Z INFO GOXDCR.CheckpointMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 Retry vbts=[vbno=36, uuid=27833190048882, seqno=0, sn_start=0, sn_end=0, srcManId=0, tgtManId=0]

…

2023-11-10T19:51:43.604Z INFO GOXDCR.StatsMgr: 9a26b5281c0bb68d3565f02dedbe13e3/b1/b2 total_docs=0, docs_processed=0, changes_left=0

After the rollback, we indeed see that the source bucket has no document per Ephemeral bucket failover.
After this point, we need to examine the target.

On the target, we see VBUUID changed from prior to failover:

2023-11-10T19:43:56.815Z INFO GOXDCR.CheckpointMgr: high_seqno_and_vbuuid_map=map[0:[2 154426843721660] 1:[3 210277909393215] 2:[1 84684117587385] 3:[1 8290499089518] 4:[1 196910853336884] 5:[0 199740622769747] 6:[3 119872586875387] 7:[4 222892142227865] 8:[4 95022217696648] 9:[3 39012357378059] 10:[1 211788290813736] 11:[1 34673846444505] 12:[0 228661520770186] 13:[1 166863773917663] 14:[3 137415014364077] 15:[2 77241665392177] 16:[0 39318197108351] 17:[1 51085109810178] 18:[2 254941462668812] 19:[3 216767593795803] 20:[3 212991720192003] 21:[4 153567376942254] 22:[2 270555470095135] 23:[0 249286862309245] 24:[2 134403816601229] 25:[0 124606719495893] 26:[4 224351320372821] 27:[3 225902359461913] 28:[3 62391942643225] 29:[2 244965601549711] 30:[0 206364212072488] 31:[1 237853971712258] 32:[1 41949912824453] 33:[1 59002330479586] 34:[3 140832479357498] 35:[2 223724436483744] 36:[5 27833190048882] 37:[3 81878845313773] 38:[0 249417699123833] 39:[2 187641062205983] 40:[2 202554840050771] 41:[2 157974071400220] 42:[3 102706939829461] 43:[5 253384948749306

with some connectivity issues in between:

2023-11-10T19:53:56.776Z WARN GOXDCR.CheckpointMgr: MainPipeline cae0a545b06006cf27f38e63a52f3130/b2/b1 Error getting vbucket-seqno stats for serverAddr=172.31.45.240:11210. vbnos=[0 1 2 3 4 …1022 1023], err=write tcp 172.31.44.32:37796->172.31.45.240:11210: write: broken pipe

to post failover:

2023-11-10T19:53:57.020Z INFO GOXDCR.CheckpointMgr: high_seqno_and_vbuuid_map=map[0:[0 126957815739762] 1:[0 73656497180216] 2:[0 210882170864119] 3:[0 227669496888839] 4:[0 234892281058358] 5:[0 77855964368367] 6:[0 94592576861185] 7:[0 97404996130780] 8:[0 202473979135397] 9:[0 137334799062834] 10:[0 49715392751982] 11:[0 77039012834310] 12:[0 143315529511933] 13:[0 256444867542785] 14:[0 98071247603780] 15:[0 118524803845585] 16:[0 54603363440149] 17:[0 99076048686569] 18:[0 197290223413343] 19:[0 252515855428545] 20:[0 165203606207565]

The odd thing is that this should have triggered a restart but it did not.

Diving in deeper by comparing the logs between test 1 and test 2, I realized that the timing of the situation is different. Test 1 has the situation where checkpoint manager has not done any checkpointing at all, then the failover occurs on the target. For test 2, at least one successful checkpoint operation has occurred, then the failover happens on the target. That’s the difference I can see.

This is most likely due to the fact that no mutation was written, meaning that the through sequence number did not move. So, XDCR checkpoint manager will skip the actual checkpointing operation. When it skips the checkpointing, it also skips the VBUUID check verification. I suspect that if one more mutation was written, it would cause VBUUID detection to occur.

I was able to write a test and verify this theory and reproduce the issue, such that Checkpoint Manager was aware of the target vbuuid change but did not restart the pipeline.
By writing one more document, we got the actual restart:

2023-11-15T13:43:34.277-08:00 INFO GOXDCR.CheckpointMgr: pipelineFullTopic=5bc716dfb2b79f1429bf75a391164885/B1/B2, Errors encountered in checkpointing: map[406:target vbuuid has changed]

…

2023-11-15T13:43:34.418-08:00 INFO GOXDCR.PipelineMgr: Replication Status=name={5bc716dfb2b79f1429bf75a391164885/B1/B2}, status={Pending}, errors={[{"time":"2023-11-15T13:43:34.118729-08:00","errMsg":"TopoChangeDet : target vbuuid has changed"}]}, oldProgress={Source nozzles have been closed}, progress={Pipeline has been stopped}

Attachments

Issue Links

backports to

MB-59703 [BP 7.2.4] - XDCR - checkpointMgr failover detection only works if seqno progressed

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Ayush Nayyar

Reporter:: Neil Huang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Nov/23 2:50 PM

Updated:: 05/Mar/24 6:31 AM

Resolved:: 17/Nov/23 11:12 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-59669: target failover detection improvements: Gerrit Review:

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty