[BP 7.2.4] - XDCR - make backfill pipeline idle detection more intelligent

Description

XDCR's backfill pipeline was designed is such that the following occurs:

  1. Backfill pipeline is created and asks for a set of VBs

  2. Streams created for the subset of VBs to ask from KV DCP

  3. Once a single VB has finished out of the original subset of VBs, XDCR starts a timer

  4. If all the VBs requested originally finished within the timer expiring, things are all good.

  5. If a subset of VBs requested originally did not finish, the timer fires, then the pipeline will restart with the unfinished set of VBs.

Timer code: https://src.couchbase.org/source/xref/7.2.2/goproj/src/github.com/couchbase/goxdcr/service_impl/through_seqno_tracker_service.go#1121

https://couchbasecloud.atlassian.net/browse/MB-57304#icft=MB-57304 in 7.2.2 introduced DCP backfill limit of 64 streams.

This leads to the fact that only 64 VBs will proceed at once, and the rest of the VBs will not. This means that as soon as 1 VB of the first batch is finished, the timer starts. The assumption that all VBs proceed at the same time is broken.

The end result is that the timer is too aggressive.

We should revisit the timeout timer to be more intelligent instead of a blanket timer. For example, maybe the timer can be reset if the number of VBs that are being done is progressing.

 

Components

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Activity

Show:

Neil Huang January 3, 2024 at 6:48 PM

Release Notes
Problem Description: In a slow running backfill replication, XDCR could be too aggressive in restarting pipelines
Resolution: Ensure XDCR does not restart backfill pipelines if some progress is observed periodically

Ayush Nayyar December 14, 2023 at 12:22 PM

Verified on 7.2.4-7045.

CB robot November 30, 2023 at 8:57 AM

Build couchbase-server-7.2.4-7030 contains goxdcr commit dd5a8dc with commit message:
https://couchbasecloud.atlassian.net/browse/MB-59850#icft=MB-59850: skip the backfill killTimer tick if progress was observed

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

No

Triage

Untriaged

Issue Impact

external

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created November 28, 2023 at 6:37 AM
Updated March 21, 2025 at 2:48 AM
Resolved November 30, 2023 at 7:07 AM
Instabug