Details
Description
A node is added to a cluster (i.e. rebalanced in)
2023-02-03T00:51:44.612Z INFO GOXDCR.ReplMgr: GOMAXPROCS=2
|
The backfill manager shows an error:
2023-02-03T00:51:44.878Z ERRO GOXDCR.BackfillMgr: Retrieving manifest for spec 45e703987f4caca84e5aaf3a3c880eb9/B1/B2 returned metakv manifests Has not been loaded yet
|
This error is shown when:
func (a *CollectionsManifestAgent) GetLastPersistedManifests() (*metadata.CollectionsManifestPair, error) {
|
is called.
As such… the following “last cached” will be empty.
func (b *BackfillMgr) retrieveLastPersistedManifest(spec *metadata.ReplicationSpecification) error {
|
manifestPair, err := b.collectionsManifestSvc.GetLastPersistedManifests(spec)
|
if err != nil {
|
return err
|
}
|
b.cacheMtx.Lock()
|
b.logger.Infof("Backfill Manager for replication %v received last persisted manifests of %v and %v",
|
spec.Id, manifestPair.Source, manifestPair.Target)
|
b.cacheSpecSourceMap[spec.Id] = manifestPair.Source
|
b.cacheSpecLastSuccessfulManifestId[spec.Id] = manifestPair.Source.Uid()
|
b.cacheSpecTargetMap[spec.Id] = manifestPair.Target
|
b.cacheMtx.Unlock()
|
return nil
|
}
|
So instead of the log message of “received last persisted manifests”, we get the error
2023-02-03T00:51:44.878Z ERRO GOXDCR.BackfillMgr: Retrieving manifest for spec 45e703987f4caca84e5aaf3a3c880eb9/B1/B2 returned metakv manifests Has not been loaded yet
|
and it fills the cache with default manifest.
Later down the line, when the target manifest has been updated, and a backfill should have been created, it shows the following error:
2023-02-03T00:54:34.889Z INFO GOXDCR.CollectionsManifestSvc: 45e703987f4caca84e5aaf3a3c880eb9/B1/B2 - Updated target manifest from old version 4 to new version 8
|
2023-02-03T00:54:34.889Z INFO GOXDCR.BackfillMgr: Repl 45e703987f4caca84e5aaf3a3c880eb9/B1/B2 shows default source manifest, and not under explicit nor migration mode, thus no backfill would be created
|
Because it didn’t successfully set the “current” (at the time node is rebalanced in) source manifest as the baseline, but rather the default manifest, this causes a missed backfill, and thus missed data.
To reproduce:
- Create 1-node source cluster with 1-node target cluster
- Create source bucket with collection but target with missing matching one
- Create replication
- Rebalance 1 node into the source to make it a 2-node source cluster
- Rebalance 1 node into the target to make it a 2-node target cluster
- Run data load onto source bucket collection, of which the target bucket is missing to create a mismatch.
- Create missing target collection to create a match
- Backfill does not occur even after the latest target manifest has been pulled by the source.
The workaround without this fix is to write a single mutation to the source bucket to trigger a backfill. However, if no such mutation takes place such that the system sits idle, the backfill will not take place.
This is caused by p2p as starting the pipeline pulling checkpoint will cause the beginning of the vbuckets to be missed, whereas in 7.0.x, p2p did not exist, so it didn't show up.
In live production, it is probably unlikely for a bucket to sit idle and have no mutation, so I suspect customers are not seeing the problem widely. But, this is still a missing data scenario and needs to be fixed.
Attachments
Issue Links
- backports to
-
MB-57382 [BP 7.2.2] - XDCR - rebalanced-in source node could miss raising backfills
- Closed