Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59073

[BP 7.1.7] - XDCR - backfill manager race conditions

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • 7.1.7
    • Morpheus, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1
    • XDCR
    • Untriaged
    • 0
    • Unknown

    Description

      Race Condition 1

      Race condition is found via testing and running through the dev collection test suites

      The race happens when a spec is to be deleted at the same time backfillMgr or backfill handler is handling a spec change for committing a backfill request:

      2023-05-23T13:15:34.441-07:00 INFO GOXDCR.ReplMgr: Deleting replication 62a76551fb293525f2319b98585ac59b/B0/B1
      2023-05-23T13:15:34.441-07:00 ERRO GOXDCR.ReplMgr: requested resource not found
      2023-05-23T13:15:34.441-07:00 INFO GOXDCR.AdminPort: Finished doDeleteReplicationRequest
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.BackfillMgr: Backfill spec change callback for 62a76551fb293525f2319b98585ac59b/B1/B2 detected old  new vb: 325 (0,69]  vb: 850 (0,29]  vb: 227 (0,61]  vb: 368 (0,37]  vb: 824 (0,37]  vb: 989 (0,65]  vb: 808 (0,57]  vb: 765 (0,29]  vb: 327 (0,17]  vb: 154 (0,61]  vb: 106 (0,29]  vb: 29 (0,17]  vb: 972 (0,25]  vb: 702 (0,69]  vb: 1009 (0,61]  vb: 877 (0,29]  vb: 292 (0,37]  vb: 44 (0,21]  vb: 722 (0,69]  vb: 546 (0,13]  vb: 732 (0,69]  vb: 266 (0,65]
      

      As a result, the backfill request is launched but the replication spec is no longer there

      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.BackfillReplSvc: BackfillReplication spec 62a76551fb293525f2319b98585ac59b/B1/B2 has been updated, rev=0x1005ddfa0
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer 62a76551fb293525f2319b98585ac59b/B1/B2 handling job: Job for 62a76551fb293525f2319b98585ac59b/B1/B2 Type: BackfillPipelineStart
      2023-05-23T13:15:34.953-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status 62a76551fb293525f2319b98585ac59b/B1/B2, failed to retrieve spec. err=requested resource not found
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: ReplicationStatus is created and set with 62a76551fb293525f2319b98585ac59b/B1/B2
      2023-05-23T13:15:34.953-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status 62a76551fb293525f2319b98585ac59b/B1/B2, failed to retrieve spec. err=requested resource not found
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater 62a76551fb293525f2319b98585ac59b/B1/B2 is launched with retry_interval=10
      2023-05-23T13:15:34.953-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status 62a76551fb293525f2319b98585ac59b/B1/B2, failed to retrieve spec. err=requested resource not found
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Replication status received startBackfill, current status=name={62a76551fb293525f2319b98585ac59b/B1/B2}, status={Pending}, errors={[]}, oldProgress={}, progress={}
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer 62a76551fb293525f2319b98585ac59b/B1/B2 done handling job: Job for 62a76551fb293525f2319b98585ac59b/B1/B2 Type: BackfillPipelineStart
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Replication 62a76551fb293525f2319b98585ac59b/B1/B2's backfill Pipeline is starting
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Starting the backfill pipeline 62a76551fb293525f2319b98585ac59b/B1/B2
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Replication 62a76551fb293525f2319b98585ac59b/B1/B2 backfill start experienced error(s): pipelineMgr.ReplicationStatus(62a76551fb293525f2319b98585ac59b/B1/B2) : Error: Replication Status is missing when starting pipeline.. Scheduling a redo.
      2023-05-23T13:15:34.953-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status 62a76551fb293525f2319b98585ac59b/B1/B2, failed to retrieve spec. err=requested resource not found
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Replication status is updated with backfill error(s) pipelineMgr.ReplicationStatus(62a76551fb293525f2319b98585ac59b/B1/B2) : Error: Replication Status is missing when starting pipeline., current status=name={62a76551fb293525f2319b98585ac59b/B1/B2}, status={Pending}, errors={[]}, oldProgress={}, progress={}
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Backfill Replication 62a76551fb293525f2319b98585ac59b/B1/B2's status experienced changes or errors (<nil>). However, last update resulted in failure, so will reschedule a future update
      2023-05-23T13:15:34.953-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater scheduled to start backfill pipeline in 10s
      2023-05-23T13:15:35.059-07:00 INFO GOXDCR.ThrSeqTrackSvc: backfill_62a76551fb293525f2319b98585ac59b/B1/B2_ThroughSeqnoTracker bg scanner: pipeline no longer running, stopping
      2023-05-23T13:15:39.174-07:00 INFO GOXDCR.ResourceMgr: Resource Manager State = <nil>
      2023-05-23T13:15:39.174-07:00 INFO GOXDCR.ResourceMgr: backlogCount=0, noBacklogCount=0 extraQuota=false cpuNotMaxedCount=0 throughputDropCount=0
      2023-05-23T13:15:39.174-07:00 INFO GOXDCR.ResourceMgr: DcpPriorityMap=map[]
      

      As a result, the backfill pipeline retry loops.
      Then when a spec gets recreated, an older version of backfill is started:

      2023-05-23T13:17:58.623-07:00 INFO GOXDCR.AdminPort: doCreateReplicationRequest
      2023-05-23T13:17:58.625-07:00 INFO GOXDCR.AdminPort: Request parameters: justValidate=false, fromBucket=B1, toCluster=C2, toBucket=B2, settings=map[CollectionsMgtMulti:{map[CollectionsMgtMulti:12] map[collectionsMigrationMode:true]} checkpoint_interval:60 colMappingRules:map[(country == "United States" OR country = "Canada") AND type="brewery":S3.col1 country != "United States" AND country != "Canada" AND type="brewery":S3.col2] stats_interval:500]
      

      This can lead to bad situations like missing data or misreplicated data:

      As an example, the following DCP nozzle message shows up for the previous backfill task, even though the latest spec is a migration mode spec:

      2023-05-23T13:18:00.098-07:00 WARN GOXDCR.DcpNozzle: getHighSeqno could not find collection ID for S1.col2
      2023-05-23T13:18:00.098-07:00 WARN GOXDCR.DcpNozzle: getHighSeqno could not find collection ID for S1.col4
      2023-05-23T13:18:00.098-07:00 WARN GOXDCR.DcpNozzle: getHighSeqno could not find collection ID for S1.col1
      2023-05-23T13:18:00.098-07:00 WARN GOXDCR.DcpNozzle: getHighSeqno could not find collection ID for S1.col3
      

      The new spec has no S1.col1/2/3/4 anywhere.

      RCA
      I couldn't reproduce it locally but from log and code analysis, the race happens between backfillMgr and backfillReplSpecSvc.
      When a replication is deleted, both of them are issued on medium priority. Ideally, backfillMgr would need to stop the handlers before the specSvc is notified. Otherwise, if a handler is slow and dangling, it would lead to a late and out-of-date write to the service.

       

      Race Condition 2

      There is another race condition hidden that are shown from these tests:

      Found the issue potentially due to the test set up.

      For XDCR test 3b, we perform a rollback to 0.

      3b_pause_resume_rollback0_tc.shlib:     setReplicationSettings "C1" "B0" "C2" "B1" "xdcrDevBackfillRollbackTo0VB=$VBToRollback"
      

      Leading to:

      2023-05-24T20:04:16.713-07:00 INFO GOXDCR.DcpNozzle: Dev injection for vb 1022 received start seqno 0... forcing a rollback to 0
      2023-05-24T20:04:16.713-07:00 INFO GOXDCR.CheckpointMgr: backfill_54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 Received rollback from DCP stream vb=1022, rollbackseqno=0
      2023-05-24T20:04:16.713-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 handling job: Job for 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 Type: BackfillPipelineStopWStoppedCb
      2023-05-24T20:04:16.713-07:00 INFO GOXDCR.PipelineMgr: pipeline 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 updater received backfill callback registration. Current backfill callback count: 1
      2023-05-24T20:04:16.713-07:00 [INFO] UPR_STREAMREQ for vb 891 successful
      

      This goes to:

      func (dcp *DcpNozzle) performRollback(vbno uint16, rollbackseq uint64) error {

      Leading to checkpoint manager:

      func (ckmgr *CheckpointManager) UpdateVBTimestamps(vbno uint16, rollbackseqno uint64) (*base.VBTimestamp, error) {
      	ckmgr.logger.Infof("%v Received rollback from DCP stream vb=%v, rollbackseqno=%v\n", ckmgr.pipeline.FullTopic(), vbno, rollbackseqno)
      if ckmgr.pipeline.Type() == common.BackfillPipeline {
      	ckmgr.clearBackfillStartingTsIfNeeded(vbno, rollbackseqno)
      }
      

      following v

      func (ckmgr *CheckpointManager) NotifyBackfillMgrRollbackTo0(vbno uint16) {
      backfillMgrPipelineSvc := ckmgr.pipeline.RuntimeContext().Service(base.BACKFILL_MGR_SVC)
       
      settingsMap := make(map[string]interface{})
      settingsMap[base.NameKey] = topic
      settingsMap[metadata.CollectionsVBRollbackTo0Key] = int(vbno)
      backfillMgrPipelineSvc.UpdateSettings(settingsMap)
      

      v

      vbno, exists3 := settings[metadata.CollectionsVBRollbackTo0Key].(int)
      if topicExists && exists3 && vbno >= 0 {
      	err := p.backfillMgr.HandleRollbackTo0ForVB(topic, uint16(vbno))
      	if err != nil {
      		errMap[metadata.CollectionsVBRollbackTo0Key] = err
      	}
      }
      

      v

      func (b *BackfillMgr) HandleRollbackTo0ForVB(topic string, vbno uint16) error {
       
      b.specReqHandlersMtx.RLock()
      handler := b.specToReqHandlerMap[topic]
      b.specReqHandlersMtx.RUnlock()
       
      if handler == nil {
      	// This should not happen
      	err := fmt.Errorf("Unable to find handler for spec %v", topic)
      	b.logger.Errorf(err.Error())
      	return err
      }
       
      <If the deletion actually happened in time, the above wouldn’t happen).
       
      // When rolling back to 0, we should delete checkpoints otherwise a non-0 resume leading to DCP rollback
      // means that it'll come back to this path yet again
      deleteCkptsWrapper := func() error {
      	return b.checkpointsSvc.DelCheckpointsDoc(common.ComposeFullTopic(topic, common.BackfillPipeline), vbno, "")
      }
       
      cb, errCb := handler.GetRollbackTo0VBSpecificBackfillCb(vbno, deleteCkptsWrapper)
      err := b.pipelineMgr.HaltBackfillWithCb(topic, cb, errCb, true)
      if err != nil {
      	b.logger.Errorf("Unable to request backfill pipeline to stop for %v : %v - vbno %v rollback to zero from DCP will trigger this process again", topic, err, vbno)
      	return err
      }
      b.logger.Infof("NEIL DEBUG topic %v requestBackfill called 1\n", topic)
      err = b.pipelineMgr.RequestBackfill(topic)
      if err != nil {
      	b.logger.Errorf("Unable to request backfill pipeline to start for %v : %v - may require manual restart of pipeline", topic, err)
      	return err
      }
      return nil
      
      

      2023-05-24T20:04:17.359-07:00 INFO GOXDCR.CheckpointMgr: Done with setting starting seqno for BackfillPipeline backfill_54b414a3621edf5b5ebde9d6d7f28b83/B0/B1
      

      At around this time the replication is deleted

      2023-05-24T20:04:21.442-07:00 INFO GOXDCR.ReplMgr: Deleting replication 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1
      

      We got an error because it was stopping:”

      2023-05-24T20:04:26.716-07:00 ERRO GOXDCR.BackfillMgr: 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 Unable to send rollbackTo0 request due to BackfillReqHandler is stopping
      

      But because of the pipeline manager job-like design with the serializer in the beginning, then the updater channel, it will try to restart the pipeline:

      2023-05-24T20:04:26.917-07:00 INFO GOXDCR.PipelineMgr: Replication 54b414a3621edf5b5ebde9d6d7f28b83/B0/B1 backfill start experienced error(s): pipelineMgr.ReplicationStatus(54b414a3621edf5b5ebde9d6d7f28b83/B0/B1) : Error: Replication Status is missing when starting pipeline.. Scheduling a redo.
      


      We should just watch for deleted specs and don’t start pipelines in these cases.

      Race condition 3

      We also have another race here where pipeline manager could be restarted after a spec has been deleted.
      This is confirmed via a matching between starting and shutting down:

      $ grep -e "is launched with retry_interval" -e "status is finished shutting down" goxdcr.log
      2023-05-25T10:19:32.507-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 is launched with retry_interval=10
      2023-05-25T10:23:16.010-07:00 INFO GOXDCR.PipelineMgr: Replication db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 status is finished shutting down
      2023-05-25T10:23:16.011-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 is launched with retry_interval=10
      2023-05-25T10:25:27.295-07:00 INFO GOXDCR.PipelineMgr: Pipeline updater db0f8c5d7c24cf16224a1ee2a04e352b/B1/B2 is launched with retry_interval=10
      2023-05-25T10:25:41.696-07:00 INFO GOXDCR.PipelineMgr: Replication db0f8c5d7c24cf16224a1ee2a04e352b/B1/B2 status is finished shutting down

      First replication is deleted:

      2023-05-25T10:23:10.467-07:00 INFO GOXDCR.ReplMgr: Pipeline db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 is deleted
      2023-05-25T10:23:10.467-07:00 INFO GOXDCR.AdminPort: Finished doDeleteReplicationRequest
      

      Then we see a backfill pipeline start triggering a new and zombie version of the updater:

      2023-05-25T10:23:16.010-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 done handling job: Job for db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 Type: PipelineDeletion
      2023-05-25T10:23:16.010-07:00 INFO GOXDCR.PipelineMgr: PipelineOpSerializer db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 handling job: Job for db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1 Type: BackfillPipelineStart
      2023-05-25T10:23:16.010-07:00 ERRO GOXDCR.PipelineMgr: Invalid replication status db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1, failed to retrieve spec. err=requested resource not found
      2023-05-25T10:23:16.010-07:00 INFO GOXDCR.PipelineMgr: ReplicationStatus is created and set with db0f8c5d7c24cf16224a1ee2a04e352b/B0/B1
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty