Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 6.6.6
Affects Version/s: 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, 6.6.4
Component/s: XDCR
Labels:
- approved-for-6.6.6

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
No

Description

Note: 7.0 and above are not in the affected version list

Few things need to occur, with timing issue involved (preReq).

There needs to be a valid checkpoint for a VB for resumption. (i.e. seqno > purgeSeqno)
There needs to be an error condition that causes pipeline to restart
XDCR's ability to read simple_store for valid checkpoint must be severely delayed OR target connection must be delayed to cause ckpt iterative read to be slow

With the conditions in place, the following will happen in sequence.

First, the pipeline should have a valid checkpoint.
A pipeline needs to restart due to an error (or a resume after a pause)
As part restart, the pipeline will need to stop. When stopping, the checkpoint manager will write valid checkpoints for each VB. (preReq 1)
Once the pipeline is stopped, the it will start again
During the starting portion, an error condition needs to occur (preReq 2). This can be anything and in some customer situations it can be caused by network issues that either CkptMgr or XMEM taking a long time to start, leading to "start part timeout"
PreReq 3 is in play here. The pipeline start code has not had the ability to read the valid checkpoints that was written above (point 3).
Point 5 will result in a pipeline restart. During the stopping portion, checkpoint manager will perform a checkpoint. However, because the previous ckpt did not have an opportunity to be read, the through sequence number is actually 0. If the ckpt were read, the throughSeq would have been set to the previously checkpointed number.
The VBs that did not have its ckpt read (#6) will have a new checkpoint entry of seqno 0 written ahead of the last valid ckpt. Which loses previously replicated portion.

This is my theory on what is happening. Because there are concurrency in play it is hard to reproduce using simple injections.

Given a CkptMgr object, there are two concurrent paths ongoing:

Reading Ckpt and setting the ckpt_obj
Performing checkpoint by getting the throughSeqno and making sure that the throughSeqno is NOT the same as what was read in point 1.

Part 1 concurrency is here:
Each VB is being read and set.
http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#740-759

738 	for _, vbno := range listOfVbs {

739 		// use math.MaxUint64 as max_seqno to make all checkpoint records eligible

740 		vbts, vbStats, err := ckmgr.getVBTimestampAndStatsFromCkpts(vbno, ckptDocs[vbno], math.MaxUint64)

741 		if err != nil {

742 			err_info := []interface{}{vbno, err}

743 			err_ch <- err_info

744 			return

745 		}

746 		err = ckmgr.statsMgr.SetVBCountMetrics(vbno, vbStats)

747 		if err != nil {

748 			err = fmt.Errorf("%v setting vbStat %v for resulted with err %v", ckmgr.pipeline.Topic(), vbno, err)

749 			err_info := []interface{}{vbno, err}

750 			err_ch <- err_info

751 			return

752 		}

753 		err = ckmgr.setTimestampForVB(vbno, vbts)

754 		if err != nil {

755 			err_info := []interface{}{vbno, err}

756 			err_ch <- err_info

757 			return

758 		}

759 	}

This is happening on a per-VB basis, one by one. Note that a slow target connection will slow down this process as well because inside of it each one needs to call Pre-Replicate to target ns_server to validate if the ckpt is valid for resume for a target VBUUID

Note that throughSeqnumber is being set as part of “setTimestampForVB”
http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#721

715 func (ckmgr *CheckpointManager) setTimestampForVB(vbno uint16, ts *base.VBTimestamp) error {

716 	ckmgr.logger.Infof("%v Set VBTimestamp: vb=%v, ts.Seqno=%v\n", ckmgr.pipeline.Topic(), vbno, ts.Seqno)

717 	ckmgr.logger.Debugf("%v vb=%v ts=%v\n", ckmgr.pipeline.Topic(), vbno, ts)

718 	defer ckmgr.logger.Debugf("%v Set VBTimestamp for vb=%v completed\n", ckmgr.pipeline.Topic(), vbno)

720 	//set the start seqno on through_seqno_tracker_svc

721 	ckmgr.through_seqno_tracker_svc.SetStartSeqno(vbno, ts.Seqno)

Part 2 concurrency is the “checkpoint before stop” area.

Namely, it gets a throughSeqnoMap first:
http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#997

This map could be modified once gotten and is not being updated.

The theory is that while pipeline is starting, part 1 concurrency is ongoing doing VB by VB.
From the CBSE, we know that VB 94 was very slow.

Part 2 concurrency takes place and the throughSeqnoMap is gotten. Knowing VB 94 is slow, this means by the throughSeqno map for VB 94 should have shown as 0.
In the meantime, part 1 concurrency is going on and may have set the “lastSeqno” as part of “setTimestampForVB”

The lastSeqno is queried here:
http://src.couchbase.org/source/xref/6.6.4/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#1167

If there is a race, then this check:

1187 	if through_seqno == last_seqno {

1188 		ckmgr.logger.Debugf("%v No replication has happened in vb %v since replication start or last checkpoint. seqno=%v. Skip checkpointing\\n", ckmgr.pipeline.Topic(), vbno, last_seqno)

1189 		return nil

1190 	}

will have failed. This means that last_seqno was set as part of the read (part 1) and the through_seqno contained 0 for VB (2).
Note that this will result in the warning message we saw:

1183 	if through_seqno < last_seqno {

1184 		ckmgr.logger.Infof("%v Checkpoint seqno went backward, possibly due to rollback. vb=%v, old_seqno=%v, new_seqno=%v", ckmgr.pipeline.Topic(), vbno, last_seqno, through_seqno)

1185 	}

The timing is tricky and hard to repro given the sheer number of VBs and the code path, as such I cannot reproduce locally.

As part of 7.0 implementation for collections (https://review.couchbase.org/c/goxdcr/+/124353 - ... not pretty I know)… there was a safe guard in place.
If reading checkpoint is slow, this safe guard (http://src.couchbase.org/source/xref/7.0.0/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#800-803)
will prevent checkpoint from taking place (http://src.couchbase.org/source/xref/7.0.0/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#608-611 ).
The safe guard was not present in < 7.0 versions.

I believe the safe guard mentioned here will prevent the race from happening entirely, but I haven't been able to reproduce this on MadHatter in the first place.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-51604
#	Subject	Branch	Project	Status	CR	V
172872,5	MB-51604: Only allow checkpoint operations if all prev ckpts have been read	mad-hatter	goxdcr	Status: MERGED	+2	+1

Activity

People

Assignee:: Neil Huang

Reporter:: Neil Huang

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Mar/22 2:47 PM

Updated:: 04/Jan/23 2:24 AM

Resolved:: 03/Nov/22 2:41 PM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-51604: Only allow checkpoint operations if all prev ckpts have been read: Gerrit Review:

XDCR - Pipeline restart under duress can overwrite valid ckpt with seqno 0

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty