Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51604

XDCR - Pipeline restart under duress can overwrite valid ckpt with seqno 0

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 6.6.6
    • 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, 6.6.4
    • XDCR
    • Untriaged
    • 1
    • No

    Description

      Note: 7.0 and above are not in the affected version list

      Few things need to occur, with timing issue involved (preReq).

      1. There needs to be a valid checkpoint for a VB for resumption. (i.e. seqno > purgeSeqno)
      2. There needs to be an error condition that causes pipeline to restart
      3. XDCR's ability to read simple_store for valid checkpoint must be severely delayed OR target connection must be delayed to cause ckpt iterative read to be slow

      With the conditions in place, the following will happen in sequence.

      1. First, the pipeline should have a valid checkpoint.
      2. A pipeline needs to restart due to an error (or a resume after a pause)
      3. As part restart, the pipeline will need to stop. When stopping, the checkpoint manager will write valid checkpoints for each VB. (preReq 1)
      4. Once the pipeline is stopped, the it will start again
      5. During the starting portion, an error condition needs to occur (preReq 2). This can be anything and in some customer situations it can be caused by network issues that either CkptMgr or XMEM taking a long time to start, leading to "start part timeout"
      6. PreReq 3 is in play here. The pipeline start code has not had the ability to read the valid checkpoints that was written above (point 3).
      7. Point 5 will result in a pipeline restart. During the stopping portion, checkpoint manager will perform a checkpoint. However, because the previous ckpt did not have an opportunity to be read, the through sequence number is actually 0. If the ckpt were read, the throughSeq would have been set to the previously checkpointed number.
      8. The VBs that did not have its ckpt read (#6) will have a new checkpoint entry of seqno 0 written ahead of the last valid ckpt. Which loses previously replicated portion.

      This is my theory on what is happening. Because there are concurrency in play it is hard to reproduce using simple injections.

      Given a CkptMgr object, there are two concurrent paths ongoing:

      1. Reading Ckpt and setting the ckpt_obj
      2. Performing checkpoint by getting the throughSeqno and making sure that the throughSeqno is NOT the same as what was read in point 1.

      Part 1 concurrency is here:
      Each VB is being read and set.
      http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#740-759

      738 	for _, vbno := range listOfVbs {
      739 		// use math.MaxUint64 as max_seqno to make all checkpoint records eligible
      740 		vbts, vbStats, err := ckmgr.getVBTimestampAndStatsFromCkpts(vbno, ckptDocs[vbno], math.MaxUint64)
      741 		if err != nil {
      742 			err_info := []interface{}{vbno, err}
      743 			err_ch <- err_info
      744 			return
      745 		}
      746 		err = ckmgr.statsMgr.SetVBCountMetrics(vbno, vbStats)
      747 		if err != nil {
      748 			err = fmt.Errorf("%v setting vbStat %v for resulted with err %v", ckmgr.pipeline.Topic(), vbno, err)
      749 			err_info := []interface{}{vbno, err}
      750 			err_ch <- err_info
      751 			return
      752 		}
      753 		err = ckmgr.setTimestampForVB(vbno, vbts)
      754 		if err != nil {
      755 			err_info := []interface{}{vbno, err}
      756 			err_ch <- err_info
      757 			return
      758 		}
      759 	}
      

      This is happening on a per-VB basis, one by one. Note that a slow target connection will slow down this process as well because inside of it each one needs to call Pre-Replicate to target ns_server to validate if the ckpt is valid for resume for a target VBUUID

      Note that throughSeqnumber is being set as part of “setTimestampForVB”
      http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#721

      715 func (ckmgr *CheckpointManager) setTimestampForVB(vbno uint16, ts *base.VBTimestamp) error {
      716 	ckmgr.logger.Infof("%v Set VBTimestamp: vb=%v, ts.Seqno=%v\n", ckmgr.pipeline.Topic(), vbno, ts.Seqno)
      717 	ckmgr.logger.Debugf("%v vb=%v ts=%v\n", ckmgr.pipeline.Topic(), vbno, ts)
      718 	defer ckmgr.logger.Debugf("%v Set VBTimestamp for vb=%v completed\n", ckmgr.pipeline.Topic(), vbno)
      719 
      720 	//set the start seqno on through_seqno_tracker_svc
      721 	ckmgr.through_seqno_tracker_svc.SetStartSeqno(vbno, ts.Seqno)
      722 
      

      Part 2 concurrency is the “checkpoint before stop” area.

      Namely, it gets a throughSeqnoMap first:
      http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#997

      This map could be modified once gotten and is not being updated.

      The theory is that while pipeline is starting, part 1 concurrency is ongoing doing VB by VB.
      From the CBSE, we know that VB 94 was very slow.

      Part 2 concurrency takes place and the throughSeqnoMap is gotten. Knowing VB 94 is slow, this means by the throughSeqno map for VB 94 should have shown as 0.
      In the meantime, part 1 concurrency is going on and may have set the “lastSeqno” as part of “setTimestampForVB”

      The lastSeqno is queried here:
      http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#1167

      If there is a race, then this check:

      1187 	if through_seqno == last_seqno {
      1188 		ckmgr.logger.Debugf("%v No replication has happened in vb %v since replication start or last checkpoint. seqno=%v. Skip checkpointing\\n", ckmgr.pipeline.Topic(), vbno, last_seqno)
      1189 		return nil
      1190 	}
      

      will have failed. This means that last_seqno was set as part of the read (part 1) and the through_seqno contained 0 for VB (2).
      Note that this will result in the warning message we saw:

      1183 	if through_seqno < last_seqno {
      1184 		ckmgr.logger.Infof("%v Checkpoint seqno went backward, possibly due to rollback. vb=%v, old_seqno=%v, new_seqno=%v", ckmgr.pipeline.Topic(), vbno, last_seqno, through_seqno)
      1185 	}
      

      The timing is tricky and hard to repro given the sheer number of VBs and the code path, as such I cannot reproduce locally.

      As part of 7.0 implementation for collections (https://review.couchbase.org/c/goxdcr/+/124353 - ... not pretty I know)… there was a safe guard in place.
      If reading checkpoint is slow, this safe guard (http://src.couchbase.org/source/xref/7.0.0/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#800-803)
      will prevent checkpoint from taking place (http://src.couchbase.org/source/xref/7.0.0/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#608-611 ).
      The safe guard was not present in < 7.0 versions.

      I believe the safe guard mentioned here will prevent the race from happening entirely, but I haven't been able to reproduce this on MadHatter in the first place.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            neil.huang Neil Huang
            neil.huang Neil Huang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty