Details
Description
When resuming a pipeline, XDCR picks a valid checkpoint by ensuring that the VBUUID is acceptable by both source and target. Only if both source and target agree that VBUUID is valid, would XDCR resume the pipeline.
XDCR's XMEM (outgoing) nozzle contains a repair feature. When XDCR receives EOF error from a target node's connection, instead of restarting the pipeline, XDCR closes the connection, and reopens a new connection to the same target node. It has always assumed that receiving an EOF when reading from a network connection means network issues.
However, one problem that could arise is not only network issue, but if the target node has crashed or ungracefully restarted. In this case, when the crashed node restarts and recovers from an unclean shut down, a new failover log entry is generated.
(See https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/failure-scenarios.md)
When Source XDCR's XMEM nozzle repairs the connection to the downed node, it does not know that the target has changed its VBUUID, and it keeps replicating data.
Currently, it only detects the changed VBUUID when a regularly scheduled checkpoint takes place. During the checkpointing operation, it gets the failover log and compares it with the last known checkpoint. Only then, does it realize that a failover occurred, and restarts the pipeline, which will lead to an earlier checkpoint to re-replicate the lost data that occurred during the target node crash.
The correct behavior should mean that XMEM is more aware during the repair operation to check failover log, so that it can restart pipeline as soon as it realizes that the connection error is not network related, but failover related. This will also cut down on the window for missing data.