Pinned fields
Click on the next to a field label to start pinning.
Details
Assignee
Sudeep JatharSudeep JatharReporter
Neil HuangNeil HuangIs this a Regression?
NoTriage
UntriagedStory Points
1Priority
MajorInstabug
Open Instabug
Details
Details
Assignee
Sudeep Jathar
Sudeep JatharReporter
Neil Huang
Neil HuangIs this a Regression?
No
Triage
Untriaged
Story Points
1
Priority
Instabug
Open Instabug
PagerDuty
PagerDuty
PagerDuty
Sentry
Sentry
Sentry
Zendesk Support
Zendesk Support
Zendesk Support
Created September 23, 2021 at 11:15 PM
Updated March 20, 2025 at 11:33 AM
When resuming a pipeline, XDCR picks a valid checkpoint by ensuring that the VBUUID is acceptable by both source and target. Only if both source and target agree that VBUUID is valid, would XDCR resume the pipeline.
XDCR's XMEM (outgoing) nozzle contains a repair feature. When XDCR receives EOF error from a target node's connection, instead of restarting the pipeline, XDCR closes the connection, and reopens a new connection to the same target node. It has always assumed that receiving an EOF when reading from a network connection means network issues.
However, one problem that could arise is not only network issue, but if the target node has crashed or ungracefully restarted. In this case, when the crashed node restarts and recovers from an unclean shut down, a new failover log entry is generated.
(See https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/failure-scenarios.md)
When Source XDCR's XMEM nozzle repairs the connection to the downed node, it does not know that the target has changed its VBUUID, and it keeps replicating data.
Currently, it only detects the changed VBUUID when a regularly scheduled checkpoint takes place. During the checkpointing operation, it gets the failover log and compares it with the last known checkpoint. Only then, does it realize that a failover occurred, and restarts the pipeline, which will lead to an earlier checkpoint to re-replicate the lost data that occurred during the target node crash.
The correct behavior should mean that XMEM is more aware during the repair operation to check failover log, so that it can restart pipeline as soon as it realizes that the connection error is not network related, but failover related. This will also cut down on the window for missing data.