XDCR - xmem connection repair does not re-check VBUUID

Description

When resuming a pipeline, XDCR picks a valid checkpoint by ensuring that the VBUUID is acceptable by both source and target. Only if both source and target agree that VBUUID is valid, would XDCR resume the pipeline.

XDCR's XMEM (outgoing) nozzle contains a repair feature. When XDCR receives EOF error from a target node's connection, instead of restarting the pipeline, XDCR closes the connection, and reopens a new connection to the same target node. It has always assumed that receiving an EOF when reading from a network connection means network issues.

However, one problem that could arise is not only network issue, but if the target node has crashed or ungracefully restarted. In this case, when the crashed node restarts and recovers from an unclean shut down, a new failover log entry is generated.
(See https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/failure-scenarios.md)

When Source XDCR's XMEM nozzle repairs the connection to the downed node, it does not know that the target has changed its VBUUID, and it keeps replicating data.

Currently, it only detects the changed VBUUID when a regularly scheduled checkpoint takes place. During the checkpointing operation, it gets the failover log and compares it with the last known checkpoint. Only then, does it realize that a failover occurred, and restarts the pipeline, which will lead to an earlier checkpoint to re-replicate the lost data that occurred during the target node crash.

The correct behavior should mean that XMEM is more aware during the repair operation to check failover log, so that it can restart pipeline as soon as it realizes that the connection error is not network related, but failover related. This will also cut down on the window for missing data.

Components

Affects versions

Fix versions

Morpheus

Labels

None

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Linked issues

relates

MB-57353

XDCR RAS

MB-28660

perform _pre_replicate inside xdcr

Activity

Sudeep Jathar March 20, 2025 at 11:33 AM

I haven’t managed to get down to work on this yet.

Neil Huang March 19, 2025 at 9:16 PM

- any updates on this?

Neil Huang December 3, 2024 at 11:43 PM

- Given your recent work for VBUUID check work for conflict logger using gomemcached, how much work/is it feasible to do the same for this MB? Can you please investigate?

We can decide whether or not this is something we should fix in Morpheus or should it move out to Ponyo after analysis.

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Sudeep Jathar
Reporter
Neil Huang
Is this a Regression?
No
Triage
Untriaged
Story Points
1
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created September 23, 2021 at 11:15 PM

Updated March 20, 2025 at 11:33 AM

Instabug

XDCR - xmem connection repair does not re-check VBUUID

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Linked issues

relates

Activity

Sudeep Jathar March 20, 2025 at 11:33 AM

Neil Huang March 19, 2025 at 9:16 PM

Neil Huang December 3, 2024 at 11:43 PM

DetailsAssigneeSudeep JatharSudeep JatharReporterNeil HuangNeil HuangIs this a Regression?NoTriageUntriagedStory Points1PriorityMajorInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Story Points

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Details
Assignee
Sudeep Jathar
Reporter
Neil Huang
Is this a Regression?
No
Triage
Untriaged
Story Points
1
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support