Note: 7.0 and above are not in the affected version list
Few things need to occur, with timing issue involved (preReq).
- There needs to be a valid checkpoint for a VB for resumption. (i.e. seqno > purgeSeqno)
- There needs to be an error condition that causes pipeline to restart
- XDCR's ability to read simple_store for valid checkpoint must be severely delayed OR target connection must be delayed to cause ckpt iterative read to be slow
With the conditions in place, the following will happen in sequence.
- First, the pipeline should have a valid checkpoint.
- A pipeline needs to restart due to an error (or a resume after a pause)
- As part restart, the pipeline will need to stop. When stopping, the checkpoint manager will write valid checkpoints for each VB. (preReq 1)
- Once the pipeline is stopped, the it will start again
- During the starting portion, an error condition needs to occur (preReq 2). This can be anything and in some customer situations it can be caused by network issues that either CkptMgr or XMEM taking a long time to start, leading to "start part timeout"
- PreReq 3 is in play here. The pipeline start code has not had the ability to read the valid checkpoints that was written above (point 3).
- Point 5 will result in a pipeline restart. During the stopping portion, checkpoint manager will perform a checkpoint. However, because the previous ckpt did not have an opportunity to be read, the through sequence number is actually 0. If the ckpt were read, the throughSeq would have been set to the previously checkpointed number.
- The VBs that did not have its ckpt read (#6) will have a new checkpoint entry of seqno 0 written ahead of the last valid ckpt. Which loses previously replicated portion.
This is my theory on what is happening. Because there are concurrency in play it is hard to reproduce using simple injections.
Given a CkptMgr object, there are two concurrent paths ongoing:
- Reading Ckpt and setting the ckpt_obj
- Performing checkpoint by getting the throughSeqno and making sure that the throughSeqno is NOT the same as what was read in point 1.
Part 1 concurrency is here:
Each VB is being read and set.
This is happening on a per-VB basis, one by one. Note that a slow target connection will slow down this process as well because inside of it each one needs to call Pre-Replicate to target ns_server to validate if the ckpt is valid for resume for a target VBUUID
Note that throughSeqnumber is being set as part of “setTimestampForVB”
Part 2 concurrency is the “checkpoint before stop” area.
Namely, it gets a throughSeqnoMap first:
This map could be modified once gotten and is not being updated.
The theory is that while pipeline is starting, part 1 concurrency is ongoing doing VB by VB.
From the CBSE, we know that VB 94 was very slow.
Part 2 concurrency takes place and the throughSeqnoMap is gotten. Knowing VB 94 is slow, this means by the throughSeqno map for VB 94 should have shown as 0.
In the meantime, part 1 concurrency is going on and may have set the “lastSeqno” as part of “setTimestampForVB”
The lastSeqno is queried here:
If there is a race, then this check:
will have failed. This means that last_seqno was set as part of the read (part 1) and the through_seqno contained 0 for VB (2).
Note that this will result in the warning message we saw:
The timing is tricky and hard to repro given the sheer number of VBs and the code path, as such I cannot reproduce locally.
As part of 7.0 implementation for collections (https://review.couchbase.org/c/goxdcr/+/124353 - ... not pretty I know)… there was a safe guard in place.
If reading checkpoint is slow, this safe guard (http://src.couchbase.org/source/xref/7.0.0/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#800-803)
will prevent checkpoint from taking place (http://src.couchbase.org/source/xref/7.0.0/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#608-611 ).
The safe guard was not present in < 7.0 versions.
I believe the safe guard mentioned here will prevent the race from happening entirely, but I haven't been able to reproduce this on MadHatter in the first place.