Fixed
Pinned fields
Click on the next to a field label to start pinning.
Details
Assignee
Neil HuangNeil HuangReporter
Neil HuangNeil HuangIs this a Regression?
YesTriage
UntriagedStory Points
0Priority
CriticalInstabug
Open Instabug
Details
Details
Assignee
Neil Huang
Neil HuangReporter
Neil Huang
Neil HuangIs this a Regression?
Yes
Triage
Untriaged
Story Points
0
Priority
Instabug
Open Instabug
PagerDuty
PagerDuty
PagerDuty
Sentry
Sentry
Sentry
Zendesk Support
Zendesk Support
Zendesk Support
Created May 31, 2023 at 10:07 PM
Updated September 19, 2023 at 10:34 AM
Resolved June 5, 2023 at 10:38 PM
Issue
XDCR could miss replicating data. This found via the XDCR dev’s collection test suite of test 4i: https://github.com/couchbase/goxdcr/blob/master/tools/testScripts/collectionTestcases/4i_explicit_mapping_change_ts.shlib
The test was part of the effort to validate .
Running it seems to fail now, with the following scenario.
The test pauses and changes the explicit mapping at the same time, then issues a pipeline resume.
At this time, once the pipeline resumes, backfill should have happened, but there is a race condition in XDCR that could cause missed data.
In production, a pipeline could be paused or it could just be restarting due to any other error circumstances at the time a backfill is raised, and the same situation could apply.
The mechanics that causes missing data is the way backfill is to be raised, logic here: https://github.com/couchbase/goxdcr/blob/20158cb0da504b7a51cf915afdacd8e841259408/backfill_manager/backfill_request_handler.go#L1036-L1049
The issue is when pipeline is stopping and backfill is being raised, checkpoint manager is going to perform checkpoint as part of the pipeline stop.
However, backfill raise logic may require the need to read checkpoints. And both of these efforts are not coordinated. Thus, it’s possible to read an incomplete set of checkpoints, which is missing certain VBs as the checkpoint manager is performing checkpoint independently and haven’t gotten the chance to create the VBs yet.
In this scenario, the definition of “incomplete” could mean either 1) missing the VB document or 2) checkpoint seqno being written isn’t persisted yet, and backfill mgr reads an earlier seqno.
As an example:
Say VB0 is at throughSeqno 100. Checkpoint currently sits at 50.
When pipeline stops, the checkpoint needs to be updated to 100.
Before checkpoint can be written to 100, backfill raise got the previous checkpoint of 50, and performs a backfill from 0 to 50 only.
Once the main pipeline and backfill pipeline resumes, the main pipeline will start at 100, while backfill pipeline will perform backfill from 0 to 50 only.
XDCR will have missed replicating seqnos 50 to 100.
Through some debugging and analysis, the culprit is due to the checkpoint manager cache implemented as part of P2P of .
Workaround
One way to workaround is to disable checkpoint cache, using a replication setting with the key of
ckptSvcCacheEnabled
. By default as of 7.1.0, it is true.Disabling the cache will ensure the test passes and all data is backfilled.
Solution
From at least what I can see, things are raceful. Looking at the original need to implement , I would argue that it may be safer to invalidate the checkpoint cache as soon as a single VB checkpoint is inserted.
I suspect right now the cache implementation is too complex, and having the proposed solution seemed to fix the problem.
I need to re-evaluate the cost and do some rudimentary testing to ensure that P2P won't have a performance hit due to the simplified cache proposal (i.e. reintroduce ). If it doesn't, then the proposed solution would probably be best.
Testing and debugging below
test output
XDCR is missing a subset of backfill. Using the diff tool, we have the following counts with the specific VB information:
Perusing through the logs, we see the following “optimization” message:
The backfill task that corresponds with this specific missing data set is supposed to be:
Looking at the backfill tasks, we’re missing the VBs. The Backfill task isn’t being raised correctly.
Issue
Resolution
XDCR Checkpoint Manager instances were not cleaned up under certain circumstances due to timing and networking issues when contacting target, or when an invalid backfill task was fed in as input.
Checkpoint Manager instances are now cleaned up. A flag has been added to check for invalid backfill tasks.