Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55412

XDCR - rebalanced-in source node could miss raising backfills

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Morpheus
    • 7.1.4, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3
    • XDCR
    • None
    • Untriaged
    • 0
    • No

    Description

      A node is added to a cluster (i.e. rebalanced in)

      2023-02-03T00:51:44.612Z INFO GOXDCR.ReplMgr: GOMAXPROCS=2
      

      The backfill manager shows an error:

      2023-02-03T00:51:44.878Z ERRO GOXDCR.BackfillMgr: Retrieving manifest for spec 45e703987f4caca84e5aaf3a3c880eb9/B1/B2 returned metakv manifests Has not been loaded yet
      

      This error is shown when:

      func (a *CollectionsManifestAgent) GetLastPersistedManifests() (*metadata.CollectionsManifestPair, error) {
      

      is called.

      As such… the following “last cached” will be empty.

      https://github.com/couchbase/goxdcr/blob/5a79dd808cdbf17100f9ab17853e60f4f3440a0c/backfill_manager/backfill_manager.go#L490-L503

      func (b *BackfillMgr) retrieveLastPersistedManifest(spec *metadata.ReplicationSpecification) error {
      	manifestPair, err := b.collectionsManifestSvc.GetLastPersistedManifests(spec)
      	if err != nil {
      		return err
      	}
      	b.cacheMtx.Lock()
      	b.logger.Infof("Backfill Manager for replication %v received last persisted manifests of %v and %v",
      		spec.Id, manifestPair.Source, manifestPair.Target)
      	b.cacheSpecSourceMap[spec.Id] = manifestPair.Source
      	b.cacheSpecLastSuccessfulManifestId[spec.Id] = manifestPair.Source.Uid()
      	b.cacheSpecTargetMap[spec.Id] = manifestPair.Target
      	b.cacheMtx.Unlock()
      	return nil
      }
      

      So instead of the log message of “received last persisted manifests”, we get the error

      2023-02-03T00:51:44.878Z ERRO GOXDCR.BackfillMgr: Retrieving manifest for spec 45e703987f4caca84e5aaf3a3c880eb9/B1/B2 returned metakv manifests Has not been loaded yet
      

      and it fills the cache with default manifest.

      Later down the line, when the target manifest has been updated, and a backfill should have been created, it shows the following error:

      2023-02-03T00:54:34.889Z INFO GOXDCR.CollectionsManifestSvc: 45e703987f4caca84e5aaf3a3c880eb9/B1/B2 -  Updated target manifest from old version 4 to new version 8
      2023-02-03T00:54:34.889Z INFO GOXDCR.BackfillMgr: Repl 45e703987f4caca84e5aaf3a3c880eb9/B1/B2 shows default source manifest, and not under explicit nor migration mode, thus no backfill would be created
      

      Because it didn’t successfully set the “current” (at the time node is rebalanced in) source manifest as the baseline, but rather the default manifest, this causes a missed backfill, and thus missed data.

      To reproduce:

      1. Create 1-node source cluster with 1-node target cluster
      2. Create source bucket with collection but target with missing matching one
      3. Create replication
      4. Rebalance 1 node into the source to make it a 2-node source cluster
      5. Rebalance 1 node into the target to make it a 2-node target cluster
      6. Run data load onto source bucket collection, of which the target bucket is missing to create a mismatch.
      7. Create missing target collection to create a match
      8. Backfill does not occur even after the latest target manifest has been pulled by the source.

       
      The workaround without this fix is to write a single mutation to the source bucket to trigger a backfill. However, if no such mutation takes place such that the system sits idle, the backfill will not take place.
      This is caused by p2p as starting the pipeline pulling checkpoint will cause the beginning of the vbuckets to be missed, whereas in 7.0.x, p2p did not exist, so it didn't show up.

      In live production, it is probably unlikely for a bucket to sit idle and have no mutation, so I suspect customers are not seeing the problem widely. But, this is still a missing data scenario and needs to be fixed.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ayush.nayyar Ayush Nayyar
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty