Details
Description
Race Condition 1
Race condition is found via testing and running through the dev collection test suites
The race happens when a spec is to be deleted at the same time backfillMgr or backfill handler is handling a spec change for committing a backfill request:
2023-05-23T13:15:34.441-07:00 INFO GOXDCR.ReplMgr: Deleting replication 62a76551fb293525f2319b98585ac59b/B0/B1
|
2023-05-23T13:15:34.441-07:00 ERRO GOXDCR.ReplMgr: requested resource not found
|
2023-05-23T13:15:34.441-07:00 INFO GOXDCR.AdminPort: Finished doDeleteReplicationRequest
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.BackfillMgr: Backfill spec change callback for 62a76551fb293525f2319b98585ac59b/B1/B2 detected old new vb: 325 (0,69] vb: 850 (0,29] vb: 227 (0,61] vb: 368 (0,37] vb: 824 (0,37] vb: 989 (0,65] vb: 808 (0,57] vb: 765 (0,29] vb: 327 (0,17] vb: 154 (0,61] vb: 106 (0,29] vb: 29 (0,17] vb: 972 (0,25] vb: 702 (0,69] vb: 1009 (0,61] vb: 877 (0,29] vb: 292 (0,37] vb: 44 (0,21] vb: 722 (0,69] vb: 546 (0,13] vb: 732 (0,69] vb: 266 (0,65]
|
As a result, the backfill request is launched but the replication spec is no longer there
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.BackfillReplSvc: BackfillReplication spec 62a76551fb293525f2319b98585ac59b/B1/B2 has been updated, rev=0x1005ddfa0
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer 62a76551fb293525f2319b98585ac59b/B1/B2 handling job: Job for 62a76551fb293525f2319b98585ac59b/B1/B2 Type: BackfillPipelineStart
|
2023-05-23T13:15:34.953-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status 62a76551fb293525f2319b98585ac59b/B1/B2, failed to retrieve spec. err=requested resource not found
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: ReplicationStatus is created and set with 62a76551fb293525f2319b98585ac59b/B1/B2
|
2023-05-23T13:15:34.953-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status 62a76551fb293525f2319b98585ac59b/B1/B2, failed to retrieve spec. err=requested resource not found
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater 62a76551fb293525f2319b98585ac59b/B1/B2 is launched with retry_interval=10
|
2023-05-23T13:15:34.953-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status 62a76551fb293525f2319b98585ac59b/B1/B2, failed to retrieve spec. err=requested resource not found
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Replication status received startBackfill, current status=name={62a76551fb293525f2319b98585ac59b/B1/B2}, status={Pending}, errors={[]}, oldProgress={}, progress={}
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer 62a76551fb293525f2319b98585ac59b/B1/B2 done handling job: Job for 62a76551fb293525f2319b98585ac59b/B1/B2 Type: BackfillPipelineStart
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Replication 62a76551fb293525f2319b98585ac59b/B1/B2's backfill Pipeline is starting
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Starting the backfill pipeline 62a76551fb293525f2319b98585ac59b/B1/B2
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Replication 62a76551fb293525f2319b98585ac59b/B1/B2 backfill start experienced error(s): pipelineMgr.ReplicationStatus(62a76551fb293525f2319b98585ac59b/B1/B2) : Error: Replication Status is missing when starting pipeline.. Scheduling a redo.
|
2023-05-23T13:15:34.953-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status 62a76551fb293525f2319b98585ac59b/B1/B2, failed to retrieve spec. err=requested resource not found
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Replication status is updated with backfill error(s) pipelineMgr.ReplicationStatus(62a76551fb293525f2319b98585ac59b/B1/B2) : Error: Replication Status is missing when starting pipeline., current status=name={62a76551fb293525f2319b98585ac59b/B1/B2}, status={Pending}, errors={[]}, oldProgress={}, progress={}
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Backfill Replication 62a76551fb293525f2319b98585ac59b/B1/B2's status experienced changes or errors (<nil>). However, last update resulted in failure, so will reschedule a future update
|
2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater scheduled to start backfill pipeline in 10s
|
2023-05-23T13:15:35.059-07:00 INFO GOXDCR.ThrSeqTrackSvc: backfill_62a76551fb293525f2319b98585ac59b/B1/B2_ThroughSeqnoTracker bg scanner: pipeline no longer running, stopping
|
2023-05-23T13:15:39.174-07:00 INFO GOXDCR.ResourceMgr: Resource Manager State = <nil>
|
2023-05-23T13:15:39.174-07:00 INFO GOXDCR.ResourceMgr: backlogCount=0, noBacklogCount=0 extraQuota=false cpuNotMaxedCount=0 throughputDropCount=0
|
2023-05-23T13:15:39.174-07:00 INFO GOXDCR.ResourceMgr: DcpPriorityMap=map[]
|
As a result, the backfill pipeline retry loops.
Then when a spec gets recreated, an older version of backfill is started:
2023-05-23T13:17:58.623-07:00 INFO GOXDCR.AdminPort: doCreateReplicationRequest
|
2023-05-23T13:17:58.625-07:00 INFO GOXDCR.AdminPort: Request parameters: justValidate=false, fromBucket=B1, toCluster=C2, toBucket=B2, settings=map[CollectionsMgtMulti:{map[CollectionsMgtMulti:12] map[collectionsMigrationMode:true]} checkpoint_interval:60 colMappingRules:map[(country == "United States" OR country = "Canada") AND type="brewery":S3.col1 country != "United States" AND country != "Canada" AND type="brewery":S3.col2] stats_interval:500]
|
This can lead to bad situations like missing data or misreplicated data:
As an example, the following DCP nozzle message shows up for the previous backfill task, even though the latest spec is a migration mode spec:
2023-05-23T13:18:00.098-07:00 WARN GOXDCR.DcpNozzle: getHighSeqno could not find collection ID for S1.col2
|
2023-05-23T13:18:00.098-07:00 WARN GOXDCR.DcpNozzle: getHighSeqno could not find collection ID for S1.col4
|
2023-05-23T13:18:00.098-07:00 WARN GOXDCR.DcpNozzle: getHighSeqno could not find collection ID for S1.col1
|
2023-05-23T13:18:00.098-07:00 WARN GOXDCR.DcpNozzle: getHighSeqno could not find collection ID for S1.col3
|
The new spec has no S1.col1/2/3/4 anywhere.
RCA
I couldn't reproduce it locally but from log and code analysis, the race happens between backfillMgr and backfillReplSpecSvc.
When a replication is deleted, both of them are issued on medium priority. Ideally, backfillMgr would need to stop the handlers before the specSvc is notified. Otherwise, if a handler is slow and dangling, it would lead to a late and out-of-date write to the service.
Race Condition 2
There is another race condition hidden that are shown from these tests:
Found the issue potentially due to the test set up.
For XDCR test 3b, we perform a rollback to 0.
3b_pause_resume_rollback0_tc.shlib: setReplicationSettings "C1" "B0" "C2" "B1" "xdcrDevBackfillRollbackTo0VB=$VBToRollback"
|
Leading to:
2023-05-24T20:04:16.713-07:00 INFO GOXDCR.DcpNozzle: Dev injection for vb 1022 received start seqno 0... forcing a rollback to 0
|
2023-05-24T20:04:16.713-07:00 INFO GOXDCR.CheckpointMgr: backfill_54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 Received rollback from DCP stream vb=1022, rollbackseqno=0
|
2023-05-24T20:04:16.713-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 handling job: Job for 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 Type: BackfillPipelineStopWStoppedCb
|
2023-05-24T20:04:16.713-07:00 INFO GOXDCR.PipelineMgr: pipeline 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 updater received backfill callback registration. Current backfill callback count: 1
|
2023-05-24T20:04:16.713-07:00 [INFO] UPR_STREAMREQ for vb 891 successful
|
This goes to:
func (dcp *DcpNozzle) performRollback(vbno uint16, rollbackseq uint64) error {
|
…
|
Leading to checkpoint manager:
func (ckmgr *CheckpointManager) UpdateVBTimestamps(vbno uint16, rollbackseqno uint64) (*base.VBTimestamp, error) {
|
ckmgr.logger.Infof("%v Received rollback from DCP stream vb=%v, rollbackseqno=%v\n", ckmgr.pipeline.FullTopic(), vbno, rollbackseqno)
|
…
|
if ckmgr.pipeline.Type() == common.BackfillPipeline {
|
ckmgr.clearBackfillStartingTsIfNeeded(vbno, rollbackseqno)
|
}
|
following v
func (ckmgr *CheckpointManager) NotifyBackfillMgrRollbackTo0(vbno uint16) {
|
…
|
backfillMgrPipelineSvc := ckmgr.pipeline.RuntimeContext().Service(base.BACKFILL_MGR_SVC)
|
|
settingsMap := make(map[string]interface{})
|
settingsMap[base.NameKey] = topic
|
settingsMap[metadata.CollectionsVBRollbackTo0Key] = int(vbno)
|
backfillMgrPipelineSvc.UpdateSettings(settingsMap)
|
v
vbno, exists3 := settings[metadata.CollectionsVBRollbackTo0Key].(int)
|
if topicExists && exists3 && vbno >= 0 {
|
err := p.backfillMgr.HandleRollbackTo0ForVB(topic, uint16(vbno))
|
if err != nil {
|
errMap[metadata.CollectionsVBRollbackTo0Key] = err
|
}
|
}
|
v
func (b *BackfillMgr) HandleRollbackTo0ForVB(topic string, vbno uint16) error {
|
…
|
|
b.specReqHandlersMtx.RLock()
|
handler := b.specToReqHandlerMap[topic]
|
b.specReqHandlersMtx.RUnlock()
|
|
if handler == nil {
|
// This should not happen
|
err := fmt.Errorf("Unable to find handler for spec %v", topic)
|
b.logger.Errorf(err.Error())
|
return err
|
}
|
|
<If the deletion actually happened in time, the above wouldn’t happen).
|
|
// When rolling back to 0, we should delete checkpoints otherwise a non-0 resume leading to DCP rollback
|
// means that it'll come back to this path yet again
|
deleteCkptsWrapper := func() error {
|
return b.checkpointsSvc.DelCheckpointsDoc(common.ComposeFullTopic(topic, common.BackfillPipeline), vbno, "")
|
}
|
|
cb, errCb := handler.GetRollbackTo0VBSpecificBackfillCb(vbno, deleteCkptsWrapper)
|
…
|
err := b.pipelineMgr.HaltBackfillWithCb(topic, cb, errCb, true)
|
if err != nil {
|
b.logger.Errorf("Unable to request backfill pipeline to stop for %v : %v - vbno %v rollback to zero from DCP will trigger this process again", topic, err, vbno)
|
return err
|
}
|
b.logger.Infof("NEIL DEBUG topic %v requestBackfill called 1\n", topic)
|
err = b.pipelineMgr.RequestBackfill(topic)
|
if err != nil {
|
b.logger.Errorf("Unable to request backfill pipeline to start for %v : %v - may require manual restart of pipeline", topic, err)
|
return err
|
}
|
return nil
|
|
…
2023-05-24T20:04:17.359-07:00 INFO GOXDCR.CheckpointMgr: Done with setting starting seqno for BackfillPipeline backfill_54b414a3621edf5b5ebde9d6d7f28b83/B0/B1
|
At around this time the replication is deleted
2023-05-24T20:04:21.442-07:00 INFO GOXDCR.ReplMgr: Deleting replication 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1
|
We got an error because it was stopping:”
2023-05-24T20:04:26.716-07:00 ERRO GOXDCR.BackfillMgr: 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 Unable to send rollbackTo0 request due to BackfillReqHandler is stopping
|
But because of the pipeline manager job-like design with the serializer in the beginning, then the updater channel, it will try to restart the pipeline:
2023-05-24T20:04:26.917-07:00 INFO GOXDCR.PipelineMgr: Replication 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 backfill start experienced error(s): pipelineMgr.ReplicationStatus(54b414a3621edf5b5ebde9d6d7f28b83/B0/B1) : Error: Replication Status is missing when starting pipeline.. Scheduling a redo.
|
We should just watch for deleted specs and don’t start pipelines in these cases.
Race condition 3
We also have another race here where pipeline manager could be restarted after a spec has been deleted.
This is confirmed via a matching between starting and shutting down:
$ grep -e "is launched with retry_interval" -e "status is finished shutting down" goxdcr.log
|
2023-05-25T10:19:32.507-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 is launched with retry_interval=10
|
2023-05-25T10:23:16.010-07:00 INFO GOXDCR.PipelineMgr: Replication db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 status is finished shutting down
|
2023-05-25T10:23:16.011-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 is launched with retry_interval=10
|
2023-05-25T10:25:27.295-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater db0f8c5d7c24cf16224a1ee2a04e352b/B1/B2 is launched with retry_interval=10
|
2023-05-25T10:25:41.696-07:00 INFO GOXDCR.PipelineMgr: Replication db0f8c5d7c24cf16224a1ee2a04e352b/B1/B2 status is finished shutting down
|
…
|
First replication is deleted:
2023-05-25T10:23:10.467-07:00 INFO GOXDCR.ReplMgr: Pipeline db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 is deleted
|
2023-05-25T10:23:10.467-07:00 INFO GOXDCR.AdminPort: Finished doDeleteReplicationRequest
|
Then we see a backfill pipeline start triggering a new and zombie version of the updater:
2023-05-25T10:23:16.010-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 done handling job: Job for db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 Type: PipelineDeletion
|
2023-05-25T10:23:16.010-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 handling job: Job for db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 Type: BackfillPipelineStart
|
2023-05-25T10:23:16.010-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1, failed to retrieve spec. err=requested resource not found
|
2023-05-25T10:23:16.010-07:00 INFO GOXDCR.PipelineMgr: ReplicationStatus is created and set with db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1
|
Attachments
Issue Links
For Gerrit Dashboard: MB-57085 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
191557,4 | MB-57085: race conditions fixes | master | goxdcr | Status: MERGED | +2 | +1 |