[XDCR][BP 7.2.5] - Backfill mutations stuck at non zero value

Description

Backfill mutations are stuck at 31.8M mutations remaining, when target bucket is deleted and backfill mutations are high. The mutations remaining counter does not move, and is stuck at the same non zero value. There were some target topology changes while the replication was in progress as well.

Steps to reproduce:

  1. Create 2 on-prem clusters of 2 nodes each

  2. Create a test bucket on each cluster

  3. Setup replication between the clusters with scopes and collections mapping set to a non default scope, post adding a remote cluster reference from source to target cluster

  4. Load documents in default scope and default collection, this will not get replicated for now since the scope and collection mapping is set for non default scope

  5. Now edit the replication and add default scope and collection in scopes and collections mapping

  6. Delete the target bucket post documents on source and target are synced, recreate bucket and replication with same name and settings

  7. Backfill mutations are stuck, number of docs on source and target are same.

Snippet from the logs:

# HELP xdcr_changes_left_total Given the vBuckets of this node, the number of sequence numbers that need to be processed (either replicated or handled) before catching up to the high sequence numbers for the vBuckets. # TYPE xdcr_changes_left_total gauge xdcr_changes_left_total {targetClusterUUID="da69a312ce9421936ede6d891380ab3b", sourceBucketName="test", targetBucketName="test", pipelineType="Backfill"} 11884587 xdcr_changes_left_total {targetClusterUUID="da69a312ce9421936ede6d891380ab3b", sourceBucketName="test", targetBucketName="test", pipelineType="Main"} 0

Logs (All nodes on 7.2.5-7576)

Source (172.23.105.4, 172.23.96.197): 
https://cb-engineering.s3.amazonaws.com/MB-60859_src/collectinfo-2024-04-01T131500-ns_1%40172.23.105.4.zip
https://cb-engineering.s3.amazonaws.com/MB-60859_src/collectinfo-2024-04-01T131500-ns_1%40172.23.96.197.zip

Target (172.23.105.195, 172.23.96.183):

https://cb-engineering.s3.amazonaws.com/MB-60859_dest/collectinfo-2024-04-01T131652-ns_1%40172.23.105.195.zip
https://cb-engineering.s3.amazonaws.com/MB-60859_dest/collectinfo-2024-04-01T131652-ns_1%40172.23.96.183.zip

Components

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Attachments

2
  • 01 Apr 2024, 07:47 PM
  • 01 Apr 2024, 01:29 PM

Activity

Show:

Ayush Nayyar April 3, 2024 at 7:34 AM

Verified on 7.2.5-7585.

CB robot April 2, 2024 at 6:23 PM

Build couchbase-server-7.2.5-7585 contains goxdcr commit 83f200d with commit message:
https://couchbasecloud.atlassian.net/browse/MB-61359#icft=MB-61359: Reset backfill stats in StopPipeline, if replication spec is deleted or recreated

Sumukh Bhat April 2, 2024 at 6:21 AM
Edited

It seems that as part of replication spec GC (since target bucket is deleted), we eventually call `StopPipeline` which deletes main pipeline checkpoints and resets stats when replication spec was deleted/recreated, but not for backfill pipeline:

func (pipelineMgr *PipelineManager) StopPipeline(rep_status pipeline.ReplicationStatusIface) base.ErrorMap { ... // if replication spec has been deleted // or deleted and recreated, which is signaled by change in spec internal id // perform clean up spec, _ := pipelineMgr.repl_spec_svc.ReplicationSpec(replId) if spec == nil || (rep_status.GetSpecInternalId() != "" && rep_status.GetSpecInternalId() != spec.InternalId) { if spec == nil { pipelineMgr.logger.Infof("%v Cleaning up replication status since repl spec has been deleted.\n", replId) } else { pipelineMgr.logger.Infof("%v Cleaning up replication status since repl spec has been deleted and recreated. oldSpecInternalId=%v, newSpecInternalId=%v\n", replId, rep_status.GetSpecInternalId(), spec.InternalId) } pipelineMgr.checkpoint_svc.DelCheckpointsDocs(replId) rep_status.ResetStorage(common.MainPipeline) pipelineMgr.repl_spec_svc.SetDerivedObj(replId, nil) //close the connection pool for the replication pools := base.ConnPoolMgr().FindPoolNamesByPrefix(replId) for _, poolName := range pools { base.ConnPoolMgr().RemovePool(poolName) } }

`StopPipeline` calls `StopBackfillPipeline`, but StopBackfillPipeline differs from StopPipeline such that it needs an explicit call to `CleanBackfillPipeline` to clean backfill pipeline checkpoints and reset stats. Checkpoints are deleted for backfill pipeline in `postDeleteBackfillRepl` as part of replication spec deletion callback, but stats are never reset.

Note that this bug should be in master as well and is not a regression in 7.2.5.

Sumukh Bhat April 2, 2024 at 4:15 AM
Edited

Update: Was able to reproduce locally with neo HEAD. If we delete the bucket which has an ongoing backfill replication, recreate the bucket with the same name and recreate the replication, seems like the stats are being taken from the old replication for backfill replication.

Ayush Nayyar April 1, 2024 at 7:47 PM

I was able to replicate this in 7.2.5-7572 as well. 

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

No

Triage

Untriaged

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created April 1, 2024 at 1:36 PM
Updated September 17, 2024 at 3:05 PM
Resolved April 2, 2024 at 6:09 PM
Instabug

Flag notifications