[BP 7.1.5] - XDCR - Checkpoint manager can create out-of-order checkpoint records

Description

It is theoretically possible for checkpoint manager to spawn too many checkpoint go-routines and create out-of-order checkpoints when there’s a large number of slow target nodes.

There are some red herrings.
First, when looking at pprof files, we can notice an extraordinary amount of go-routies in this type of stack trace:

goxdcr 6.6.3

This is likely given that checkpoint manager fires off a checkpoint operation periodically:

https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#971

If a replication contains many targets, and if targets are slow, then it is possible for this operation here to take a long time:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
Theoretically, it could get stuck for a long time without any timeouts.
ExponentialBackoffExecutorWithFinishSignal does not take into considerations timeout. Just retries.

Second - there is an unequal number of “Start checkpointing” vs “Done Checkpointing”

In either case, it means that there is a bottleneck and there are more concurrent go-routines being stuck (or were stuck) than the periodic checkpoint interval.

If Num(Start) > Num(Done) -> there are go-routines building up
If Num(Done) > Num(Start) -> Previously built-up go-routines are finishing

How incorrect checkpoints could be created
In 1.9, golang introduced a “Starvation mode” for sync.Mutex (https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/mutex.go;l=44-66)
Starvation mode means that the lock could become a FIFO queue for callers if a caller waits for more than 1ms.

And when I checked the source for Golang 1.11.5, which is what XDCR 6.6.3 is based on, RWMutex does not have such starvation mode.
https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/rwmutex.go

However, the code in Checkpoint Manager here:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
uses RW Mutex. This means that the locking mechanism here is a “Thunder Lock” (https://hidetatz.io/articles/2021/04/07/go_mutex_starvation/)
The theory below is based on this concept.

When an operation within the lock takes a long time, such as https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406 that I mentioned here, it is theoretically possible for multiple go-routines to be spawned within this time.
Leading to the condition where there are more “Start checkpointing for replication” than “Done checkpointing for replication” in the logs, and leading to the growing numbers of go-routines in the stack trace that I mentioned above.

Checkpoint manager in a nutshell gathers data in the following way (given a batch of vbuckets):

  1. Gets the throughSeqNo

  2. Acquires the single lock to gain access to target memcached client

  3. One at a time, retrieves each target VBUuid’s and seqnos.

When multiple go-routines build up, it is possible that:

Given throughSeqno increments from 1 to 2 to 3, etc… and target Seqno increments from A to B to C, etc

GoRoutine 1:

  1. Gets the throughSeqno of 1

  2. Acquires the single lock

  3. Due to slow target, gets stuck for a long time.
    (Note that time_committed does not account for the RPC call, just the local simple_store persistence time)

In the meantime, CkptMgr launches more go-routines:

GoRoutine 2:

  1. Gets the throughSeqno of 10. (At this point, the target high seqno theoretically is B).

  2. Tries to acquire the lock and is blocked

<Some time passes>

GoRoutine 3:

  1. Gets the throughSeqno of 100 (At this point, the target high seqno is theoretically C).

  2. Tries to acquire the lock and is blocked

At this point, GoRoutine 1 finishes. The checkpoint is now consisted of one record (simplified):

GoRoutines 2 and 3 fight for the lock. 3 Gets it.

GoRoutine 3 acquires a target highSeqno of D, where D comes after C.

GoRoutine 2 finally gets the lock. Acquires a target highSeqno of F, where F comes after D.

Notice how the checkpoint is incorrect: The chronological order of the source is broken. An “older” record no longer necessarily represents an earlier point in time. (i.e <100, D>) in terms of the source. But it does represent an earlier point in time on the target.

At this point, I do not see this as a cause for data loss when a target VB fails over and gets a new VBUUID, but it may incur extra cost of re-streaming data.

 

 

Issue

Resolution

Checkpoint Manager created checkpoint records out-of-sequence when many target nodes ran slowly.

Checkpoint Manager now creates checkpoints in sequence when target nodes are slow.

Components

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Activity

Show:

Ayush Nayyar June 26, 2023 at 9:29 AM
Edited

Verified on 7.1.5-3848 with a large source and an undersized target cluster. No abnormal stack trace observed.

Neil Huang June 21, 2023 at 4:46 PM
Edited

To verify, create a replication from a large source cluster to an undersized target cluster. Make sure that the following stack trace doesn't show up as a large and abnormal number.

CB robot June 3, 2023 at 4:14 AM

Build couchbase-server-7.1.5-3829 contains goxdcr commit e659c3b with commit message:
: periodic ckpt to prevent build up if target is slow

CB robot June 3, 2023 at 4:14 AM

Build couchbase-server-7.1.5-3829 contains goxdcr commit c73c66b with commit message:
: unit test

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

No

Triage

Untriaged

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created May 8, 2023 at 10:19 PM
Updated September 19, 2023 at 10:24 AM
Resolved June 6, 2023 at 5:39 PM
Instabug