[BP 7.1.5] - XDCR - Incorrect Backfill replication can be created with a paused main pipeline leading to missed data

Description

Issue

XDCR could miss replicating data. This found via the XDCR dev’s collection test suite of test 4i: https://github.com/couchbase/goxdcr/blob/master/tools/testScripts/collectionTestcases/4i_explicit_mapping_change_ts.shlib
The test was part of the effort to validate .

Running it seems to fail now, with the following scenario.

The test pauses and changes the explicit mapping at the same time, then issues a pipeline resume.
At this time, once the pipeline resumes, backfill should have happened, but there is a race condition in XDCR that could cause missed data.

In production, a pipeline could be paused or it could just be restarting due to any other error circumstances at the time a backfill is raised, and the same situation could apply.

The mechanics that causes missing data is the way backfill is to be raised, logic here: https://github.com/couchbase/goxdcr/blob/20158cb0da504b7a51cf915afdacd8e841259408/backfill_manager/backfill_request_handler.go#L1036-L1049
The issue is when pipeline is stopping and backfill is being raised, checkpoint manager is going to perform checkpoint as part of the pipeline stop.
However, backfill raise logic may require the need to read checkpoints. And both of these efforts are not coordinated. Thus, it’s possible to read an incomplete set of checkpoints, which is missing certain VBs as the checkpoint manager is performing checkpoint independently and haven’t gotten the chance to create the VBs yet.

In this scenario, the definition of “incomplete” could mean either 1) missing the VB document or 2) checkpoint seqno being written isn’t persisted yet, and backfill mgr reads an earlier seqno.

As an example:
Say VB0 is at throughSeqno 100. Checkpoint currently sits at 50.
When pipeline stops, the checkpoint needs to be updated to 100.
Before checkpoint can be written to 100, backfill raise got the previous checkpoint of 50, and performs a backfill from 0 to 50 only.
Once the main pipeline and backfill pipeline resumes, the main pipeline will start at 100, while backfill pipeline will perform backfill from 0 to 50 only.
XDCR will have missed replicating seqnos 50 to 100.

Through some debugging and analysis, the culprit is due to the checkpoint manager cache implemented as part of P2P of .

Workaround

One way to workaround is to disable checkpoint cache, using a replication setting with the key of ckptSvcCacheEnabled. By default as of 7.1.0, it is true.
Disabling the cache will ensure the test passes and all data is backfilled.

Solution
From at least what I can see, things are raceful. Looking at the original need to implement , I would argue that it may be safer to invalidate the checkpoint cache as soon as a single VB checkpoint is inserted.
I suspect right now the cache implementation is too complex, and having the proposed solution seemed to fix the problem.
I need to re-evaluate the cost and do some rudimentary testing to ensure that P2P won't have a performance hit due to the simplified cache proposal (i.e. reintroduce ). If it doesn't, then the proposed solution would probably be best.

Testing and debugging below

test output

XDCR is missing a subset of backfill. Using the diff tool, we have the following counts with the specific VB information:

Perusing through the logs, we see the following “optimization” message:

The backfill task that corresponds with this specific missing data set is supposed to be:

Looking at the backfill tasks, we’re missing the VBs. The Backfill task isn’t being raised correctly.

 

 

Issue

Resolution

XDCR Checkpoint Manager instances were not cleaned up under certain circumstances due to timing and networking issues when contacting target, or when an invalid backfill task was fed in as input.

Checkpoint Manager instances are now cleaned up. A flag has been added to check for invalid backfill tasks.

Components

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Activity

Show:

CB robot June 6, 2023 at 7:18 AM

Build couchbase-server-7.1.5-3832 contains goxdcr commit bd5492e with commit message:
: synchronise checkpoint service cache

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

Yes

Triage

Untriaged

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created May 31, 2023 at 10:07 PM
Updated September 19, 2023 at 10:34 AM
Resolved June 5, 2023 at 10:38 PM
Instabug