Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, 6.6.4
-
Untriaged
-
1
-
No
Description
Note: 7.0 and above are not in the affected version list
Few things need to occur, with timing issue involved (preReq).
- There needs to be a valid checkpoint for a VB for resumption. (i.e. seqno > purgeSeqno)
- There needs to be an error condition that causes pipeline to restart
- XDCR's ability to read simple_store for valid checkpoint must be severely delayed OR target connection must be delayed to cause ckpt iterative read to be slow
With the conditions in place, the following will happen in sequence.
- First, the pipeline should have a valid checkpoint.
- A pipeline needs to restart due to an error (or a resume after a pause)
- As part restart, the pipeline will need to stop. When stopping, the checkpoint manager will write valid checkpoints for each VB. (preReq 1)
- Once the pipeline is stopped, the it will start again
- During the starting portion, an error condition needs to occur (preReq 2). This can be anything and in some customer situations it can be caused by network issues that either CkptMgr or XMEM taking a long time to start, leading to "start part timeout"
- PreReq 3 is in play here. The pipeline start code has not had the ability to read the valid checkpoints that was written above (point 3).
- Point 5 will result in a pipeline restart. During the stopping portion, checkpoint manager will perform a checkpoint. However, because the previous ckpt did not have an opportunity to be read, the through sequence number is actually 0. If the ckpt were read, the throughSeq would have been set to the previously checkpointed number.
- The VBs that did not have its ckpt read (#6) will have a new checkpoint entry of seqno 0 written ahead of the last valid ckpt. Which loses previously replicated portion.
This is my theory on what is happening. Because there are concurrency in play it is hard to reproduce using simple injections.
Given a CkptMgr object, there are two concurrent paths ongoing:
- Reading Ckpt and setting the ckpt_obj
- Performing checkpoint by getting the throughSeqno and making sure that the throughSeqno is NOT the same as what was read in point 1.
Part 1 concurrency is here:
Each VB is being read and set.
http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#740-759
738 for _, vbno := range listOfVbs {
|
739 // use math.MaxUint64 as max_seqno to make all checkpoint records eligible
|
740 vbts, vbStats, err := ckmgr.getVBTimestampAndStatsFromCkpts(vbno, ckptDocs[vbno], math.MaxUint64)
|
741 if err != nil {
|
742 err_info := []interface{}{vbno, err}
|
743 err_ch <- err_info
|
744 return
|
745 }
|
746 err = ckmgr.statsMgr.SetVBCountMetrics(vbno, vbStats)
|
747 if err != nil {
|
748 err = fmt.Errorf("%v setting vbStat %v for resulted with err %v", ckmgr.pipeline.Topic(), vbno, err)
|
749 err_info := []interface{}{vbno, err}
|
750 err_ch <- err_info
|
751 return
|
752 }
|
753 err = ckmgr.setTimestampForVB(vbno, vbts)
|
754 if err != nil {
|
755 err_info := []interface{}{vbno, err}
|
756 err_ch <- err_info
|
757 return
|
758 }
|
759 }
|
This is happening on a per-VB basis, one by one. Note that a slow target connection will slow down this process as well because inside of it each one needs to call Pre-Replicate to target ns_server to validate if the ckpt is valid for resume for a target VBUUID
Note that throughSeqnumber is being set as part of “setTimestampForVB”
http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#721
715 func (ckmgr *CheckpointManager) setTimestampForVB(vbno uint16, ts *base.VBTimestamp) error {
|
716 ckmgr.logger.Infof("%v Set VBTimestamp: vb=%v, ts.Seqno=%v\n", ckmgr.pipeline.Topic(), vbno, ts.Seqno)
|
717 ckmgr.logger.Debugf("%v vb=%v ts=%v\n", ckmgr.pipeline.Topic(), vbno, ts)
|
718 defer ckmgr.logger.Debugf("%v Set VBTimestamp for vb=%v completed\n", ckmgr.pipeline.Topic(), vbno)
|
719
|
720 //set the start seqno on through_seqno_tracker_svc
|
721 ckmgr.through_seqno_tracker_svc.SetStartSeqno(vbno, ts.Seqno)
|
722
|
Part 2 concurrency is the “checkpoint before stop” area.
Namely, it gets a throughSeqnoMap first:
http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#997
This map could be modified once gotten and is not being updated.
The theory is that while pipeline is starting, part 1 concurrency is ongoing doing VB by VB.
From the CBSE, we know that VB 94 was very slow.
Part 2 concurrency takes place and the throughSeqnoMap is gotten. Knowing VB 94 is slow, this means by the throughSeqno map for VB 94 should have shown as 0.
In the meantime, part 1 concurrency is going on and may have set the “lastSeqno” as part of “setTimestampForVB”
The lastSeqno is queried here:
http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#1167
If there is a race, then this check:
1187 if through_seqno == last_seqno {
|
1188 ckmgr.logger.Debugf("%v No replication has happened in vb %v since replication start or last checkpoint. seqno=%v. Skip checkpointing\\n", ckmgr.pipeline.Topic(), vbno, last_seqno)
|
1189 return nil
|
1190 }
|
will have failed. This means that last_seqno was set as part of the read (part 1) and the through_seqno contained 0 for VB (2).
Note that this will result in the warning message we saw:
1183 if through_seqno < last_seqno {
|
1184 ckmgr.logger.Infof("%v Checkpoint seqno went backward, possibly due to rollback. vb=%v, old_seqno=%v, new_seqno=%v", ckmgr.pipeline.Topic(), vbno, last_seqno, through_seqno)
|
1185 }
|
The timing is tricky and hard to repro given the sheer number of VBs and the code path, as such I cannot reproduce locally.
As part of 7.0 implementation for collections (https://review.couchbase.org/c/goxdcr/+/124353 - ... not pretty I know)… there was a safe guard in place.
If reading checkpoint is slow, this safe guard (http://src.couchbase.org/source/xref/7.0.0/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#800-803)
will prevent checkpoint from taking place (http://src.couchbase.org/source/xref/7.0.0/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#608-611 ).
The safe guard was not present in < 7.0 versions.
I believe the safe guard mentioned here will prevent the race from happening entirely, but I haven't been able to reproduce this on MadHatter in the first place.