[XDCR][BP 7.2.5] - Backfill mutations stuck at non zero value
Description
Components
Fix versions
Labels
Environment
Link to Log File, atop/blg, CBCollectInfo, Core dump
Release Notes Description
Attachments
- 01 Apr 2024, 07:47 PM
- 01 Apr 2024, 01:29 PM
is a backport of
Activity
Ayush Nayyar April 3, 2024 at 7:34 AM
Verified on 7.2.5-7585.
CB robot April 2, 2024 at 6:23 PM
Build couchbase-server-7.2.5-7585 contains goxdcr commit 83f200d with commit message:
https://couchbasecloud.atlassian.net/browse/MB-61359#icft=MB-61359: Reset backfill stats in StopPipeline, if replication spec is deleted or recreated
Sumukh Bhat April 2, 2024 at 6:21 AMEdited
It seems that as part of replication spec GC (since target bucket is deleted), we eventually call `StopPipeline` which deletes main pipeline checkpoints and resets stats when replication spec was deleted/recreated, but not for backfill pipeline:
func (pipelineMgr *PipelineManager) StopPipeline(rep_status pipeline.ReplicationStatusIface) base.ErrorMap {
...
// if replication spec has been deleted
// or deleted and recreated, which is signaled by change in spec internal id
// perform clean up
spec, _ := pipelineMgr.repl_spec_svc.ReplicationSpec(replId)
if spec == nil || (rep_status.GetSpecInternalId() != "" && rep_status.GetSpecInternalId() != spec.InternalId) {
if spec == nil {
pipelineMgr.logger.Infof("%v Cleaning up replication status since repl spec has been deleted.\n", replId)
} else {
pipelineMgr.logger.Infof("%v Cleaning up replication status since repl spec has been deleted and recreated. oldSpecInternalId=%v, newSpecInternalId=%v\n", replId, rep_status.GetSpecInternalId(), spec.InternalId)
}
pipelineMgr.checkpoint_svc.DelCheckpointsDocs(replId)
rep_status.ResetStorage(common.MainPipeline)
pipelineMgr.repl_spec_svc.SetDerivedObj(replId, nil)
//close the connection pool for the replication
pools := base.ConnPoolMgr().FindPoolNamesByPrefix(replId)
for _, poolName := range pools {
base.ConnPoolMgr().RemovePool(poolName)
}
}
`StopPipeline` calls `StopBackfillPipeline`, but StopBackfillPipeline differs from StopPipeline such that it needs an explicit call to `CleanBackfillPipeline` to clean backfill pipeline checkpoints and reset stats. Checkpoints are deleted for backfill pipeline in `postDeleteBackfillRepl` as part of replication spec deletion callback, but stats are never reset.
Note that this bug should be in master as well and is not a regression in 7.2.5.
Sumukh Bhat April 2, 2024 at 4:15 AMEdited
Update: Was able to reproduce locally with neo HEAD. If we delete the bucket which has an ongoing backfill replication, recreate the bucket with the same name and recreate the replication, seems like the stats are being taken from the old replication for backfill replication.
Ayush Nayyar April 1, 2024 at 7:47 PM
@Neil Huang I was able to replicate this in 7.2.5-7572 as well.
Details
Assignee
Ayush NayyarAyush NayyarReporter
Ayush NayyarAyush NayyarIs this a Regression?
NoTriage
UntriagedStory Points
0Priority
CriticalInstabug
Open Instabug
Details
Details
Assignee
Reporter
Is this a Regression?
Triage
Story Points
Priority
Instabug
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Sentry
Linked Issues
Sentry
Linked Issues
Sentry
Zendesk Support
Linked Tickets
Zendesk Support
Linked Tickets
Zendesk Support

Backfill mutations are stuck at 31.8M mutations remaining, when target bucket is deleted and backfill mutations are high. The mutations remaining counter does not move, and is stuck at the same non zero value. There were some target topology changes while the replication was in progress as well.
Steps to reproduce:
Create 2 on-prem clusters of 2 nodes each
Create a test bucket on each cluster
Setup replication between the clusters with scopes and collections mapping set to a non default scope, post adding a remote cluster reference from source to target cluster
Load documents in default scope and default collection, this will not get replicated for now since the scope and collection mapping is set for non default scope
Now edit the replication and add default scope and collection in scopes and collections mapping
Delete the target bucket post documents on source and target are synced, recreate bucket and replication with same name and settings
Backfill mutations are stuck, number of docs on source and target are same.
Snippet from the logs:
# HELP xdcr_changes_left_total Given the vBuckets of this node, the number of sequence numbers that need to be processed (either replicated or handled) before catching up to the high sequence numbers for the vBuckets. # TYPE xdcr_changes_left_total gauge xdcr_changes_left_total {targetClusterUUID="da69a312ce9421936ede6d891380ab3b", sourceBucketName="test", targetBucketName="test", pipelineType="Backfill"} 11884587 xdcr_changes_left_total {targetClusterUUID="da69a312ce9421936ede6d891380ab3b", sourceBucketName="test", targetBucketName="test", pipelineType="Main"} 0
Logs (All nodes on 7.2.5-7576)
Source (172.23.105.4, 172.23.96.197):
https://cb-engineering.s3.amazonaws.com/MB-60859_src/collectinfo-2024-04-01T131500-ns_1%40172.23.105.4.zip
https://cb-engineering.s3.amazonaws.com/MB-60859_src/collectinfo-2024-04-01T131500-ns_1%40172.23.96.197.zip
Target (172.23.105.195, 172.23.96.183):
https://cb-engineering.s3.amazonaws.com/MB-60859_dest/collectinfo-2024-04-01T131652-ns_1%40172.23.105.195.zip
https://cb-engineering.s3.amazonaws.com/MB-60859_dest/collectinfo-2024-04-01T131652-ns_1%40172.23.96.183.zip