Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60448

XDCR - Negative changes_left for a paused replication when goxdcr is killed and respawned

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      Consider the 2 types of kv_vb_map in use to calculate the stats for a paused replication in UpdateStats(...):

      A. cur_kv_vb_map, calculated as: 

      cur_kv_vb_map := notification.GetKvVbMapRO() 

      B. sourceVBMap, calculated as: 

      sourceVBMap = highSeqnoFeedNotification.GetSourceVBMapRO()

      Say there are N KV nodes in the source cluster and for the sake for simplicity, let's say that all the N nodes have T total_docs each and each have processed (docs_processed) P docs.

      The difference between the maps is that:

      (A) contains all the N nodes and stats calculated using this will be stats aggregated across the cluster level

      AND

      (B) contains only 1 node (the current node) in its map i.e. the stats calculated using this will be the stats for itself only.

      And when we hit the following codepath, we use (A)

      func UpdateStats(checkpoints_svc service_def.CheckpointsService, logger *log.CommonLogger, remoteClusterSvc service_def.RemoteClusterSvc, backfillReplSvc service_def.BackfillReplSvc, bucketTopologySvc service_def.BucketTopologySvc, repStatusMapGetter func() map[string]pipeline_pkg.ReplicationStatusIface) {
      ...
           if overview_stats == nil { 
      // overview stats may be nil the first time GetStats is called on a paused replication that has never been run in the current goxdcr session 
      // or it may be nil when the underying replication is not paused but has not completed startup process 
      // construct it 
          err := constructStatsForReplication(repl_status, spec, cur_kv_vb_map, checkpoints_svc, logger, backfillSpec, highSeqnoAndSourceVBGetter)
      ...     
      } 
      ...
      }

      constructStatsForReplication calculates the following:

      1. total_docs: highSeqNo (gotten from KV) for all the nodes in (B) = 1*T. Example, For a 3 KV node setup, because of this bug we get: 

        2024-01-16T09:30:54.661Z WARN GOXDCR.ReplMgr: Server SD-1AED-B1A6.eur.nsroot.net:11210 not found in high seqnoMap map[SD-CF28-D3F8.eur.nsroot.net:11210:0xc0003e8078]2024-01-16T09:30:54.661Z WARN GOXDCR.ReplMgr: Server SD-FD50-5D74.eur.nsroot.net:11210 not found in high seqnoMap map[SD-CF28-D3F8.eur.nsroot.net:11210:0xc0003e8078]

            2. docs_processed: seqno from checkpoints of all the VBs of nodes in (A) = N*P

            3. changes_left = total_docs - docs_processed = 1*T - N*P which potentially will go negative.

      This has to be fixed for all of this path to use (B), so that we are also consistent and calculate each overview stat for that node only.


      Additionally, in this same code path we always read the overview_stats from the main pipeline, but endup storing it in backfill pipline's stats store sometimes:

      func UpdateStats(checkpoints_svc service_def.CheckpointsService, logger *log.CommonLogger, remoteClusterSvc service_def.RemoteClusterSvc, backfillReplSvc service_def.BackfillReplSvc, bucketTopologySvc service_def.BucketTopologySvc, repStatusMapGetter func() map[string]pipeline_pkg.ReplicationStatusIface) {
              ... 
              for repl_id, repl_status := range repStatusMapGetter() {         
                     overview_stats := repl_status.GetOverviewStats(common.MainPipeline)
                     ...      
                     if overview_stats == nil { 
                          // overview stats may be nil the first time GetStats is called on a paused replication that has never been run in the current goxdcr session 
                          // or it may be nil when the underying replication is not paused but has not completed startup process 
                          // construct it     
                          err := constructStatsForReplication(repl_status, spec, cur_kv_vb_map, checkpoints_svc, logger, backfillSpec, highSeqnoAndSourceVBGetter) 
                          ...     
                      } 
              }
              ... 
      }
       
      func constructStatsForReplication(repl_status pipeline_pkg.ReplicationStatusIface, spec *metadata.ReplicationSpecification, curKvVbMapRo map[string][]uint16, checkpoints_svc service_def.CheckpointsService, logger *log.CommonLogger, backfillSpec *metadata.BackfillReplicationSpec, highSeqnosMapGetter func() (base.HighSeqnosMapType, base.KvVBMapType, func())) error {
      ...
              if backfillSpec != nil {            
                   repl_status.SetOverviewStats(overview_stats, common.BackfillPipeline)    
              } else {        
                   repl_status.SetOverviewStats(overview_stats, common.MainPipeline)       
              }
      ...
      } 

      This may need revisiting as well.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ayush.nayyar Ayush Nayyar
              sumukh.bhat Sumukh Bhat
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty