[BP 7.1.5] - XDCR - Checkpoint manager can create out-of-order checkpoint records

Description

It is theoretically possible for checkpoint manager to spawn too many checkpoint go-routines and create out-of-order checkpoints when there’s a large number of slow target nodes.

There are some red herrings.
First, when looking at pprof files, we can notice an extraordinary amount of go-routies in this type of stack trace:

goxdcr 6.6.3

This is likely given that checkpoint manager fires off a checkpoint operation periodically:

https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#971

If a replication contains many targets, and if targets are slow, then it is possible for this operation here to take a long time:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
Theoretically, it could get stuck for a long time without any timeouts.
ExponentialBackoffExecutorWithFinishSignal does not take into considerations timeout. Just retries.

Second - there is an unequal number of “Start checkpointing” vs “Done Checkpointing”

In either case, it means that there is a bottleneck and there are more concurrent go-routines being stuck (or were stuck) than the periodic checkpoint interval.

If Num(Start) > Num(Done) -> there are go-routines building up
If Num(Done) > Num(Start) -> Previously built-up go-routines are finishing

How incorrect checkpoints could be created
In 1.9, golang introduced a “Starvation mode” for sync.Mutex (https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/mutex.go;l=44-66)
Starvation mode means that the lock could become a FIFO queue for callers if a caller waits for more than 1ms.

And when I checked the source for Golang 1.11.5, which is what XDCR 6.6.3 is based on, RWMutex does not have such starvation mode.
https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/rwmutex.go

However, the code in Checkpoint Manager here:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
uses RW Mutex. This means that the locking mechanism here is a “Thunder Lock” (https://hidetatz.io/articles/2021/04/07/go_mutex_starvation/)
The theory below is based on this concept.

When an operation within the lock takes a long time, such as https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406 that I mentioned here, it is theoretically possible for multiple go-routines to be spawned within this time.
Leading to the condition where there are more “Start checkpointing for replication” than “Done checkpointing for replication” in the logs, and leading to the growing numbers of go-routines in the stack trace that I mentioned above.

Checkpoint manager in a nutshell gathers data in the following way (given a batch of vbuckets):

Gets the throughSeqNo
Acquires the single lock to gain access to target memcached client
One at a time, retrieves each target VBUuid’s and seqnos.

When multiple go-routines build up, it is possible that:

Given throughSeqno increments from 1 to 2 to 3, etc… and target Seqno increments from A to B to C, etc

GoRoutine 1:

Gets the throughSeqno of 1
Acquires the single lock
Due to slow target, gets stuck for a long time.
(Note that time_committed does not account for the RPC call, just the local simple_store persistence time)

In the meantime, CkptMgr launches more go-routines:

GoRoutine 2:

Gets the throughSeqno of 10. (At this point, the target high seqno theoretically is B).
Tries to acquire the lock and is blocked

GoRoutine 3:

Gets the throughSeqno of 100 (At this point, the target high seqno is theoretically C).
Tries to acquire the lock and is blocked

At this point, GoRoutine 1 finishes. The checkpoint is now consisted of one record (simplified):

GoRoutines 2 and 3 fight for the lock. 3 Gets it.

GoRoutine 3 acquires a target highSeqno of D, where D comes after C.

GoRoutine 2 finally gets the lock. Acquires a target highSeqno of F, where F comes after D.

Notice how the checkpoint is incorrect: The chronological order of the source is broken. An “older” record no longer necessarily represents an earlier point in time. (i.e <100, D>) in terms of the source. But it does represent an earlier point in time on the target.

At this point, I do not see this as a cause for data loss when a target VB fails over and gets a new VBUUID, but it may incur extra cost of re-streaming data.

Issue	Resolution
Checkpoint Manager created checkpoint records out-of-sequence when many target nodes ran slowly.	Checkpoint Manager now creates checkpoints in sequence when target nodes are slow.

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Linked issues

is a backport of

MB-54385

XDCR - Checkpoint manager can create out-of-order checkpoint records

Activity

Show:

Ayush Nayyar June 26, 2023 at 9:29 AM
Edited

Verified on 7.1.5-3848 with a large source and an undersized target cluster. No abnormal stack trace observed.

Neil Huang June 21, 2023 at 4:46 PM
Edited

To verify, create a replication from a large source cluster to an undersized target cluster. Make sure that the following stack trace doesn't show up as a large and abnormal number.

CB robot June 3, 2023 at 4:14 AM

Build couchbase-server-7.1.5-3829 contains goxdcr commit e659c3b with commit message:
: periodic ckpt to prevent build up if target is slow

CB robot June 3, 2023 at 4:14 AM

Build couchbase-server-7.1.5-3829 contains goxdcr commit c73c66b with commit message:
: unit test

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Ayush Nayyar
Reporter
Neil Huang
Is this a Regression?
No
Triage
Untriaged
Story Points
1
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created May 8, 2023 at 10:19 PM

Updated September 19, 2023 at 10:24 AM

Resolved June 6, 2023 at 5:39 PM

Configure

Instabug

[BP 7.1.5] - XDCR - Checkpoint manager can create out-of-order checkpoint records

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Linked issues

is a backport of

Activity

Ayush Nayyar June 26, 2023 at 9:29 AMEdited

Neil Huang June 21, 2023 at 4:46 PMEdited

CB robot June 3, 2023 at 4:14 AM

CB robot June 3, 2023 at 4:14 AM

DetailsAssigneeAyush NayyarAyush NayyarReporterNeil HuangNeil HuangIs this a Regression?NoTriageUntriagedStory Points1PriorityMajorInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Story Points

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Ayush Nayyar June 26, 2023 at 9:29 AM
Edited

Neil Huang June 21, 2023 at 4:46 PM
Edited

Details
Assignee
Ayush Nayyar
Reporter
Neil Huang
Is this a Regression?
No
Triage
Untriaged
Story Points
1
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support