Fixed
Pinned fields
Click on the next to a field label to start pinning.
Details
Assignee
Ayush NayyarAyush NayyarReporter
Neil HuangNeil HuangIs this a Regression?
NoTriage
UntriagedStory Points
1Priority
MajorInstabug
Open Instabug
Details
Details
Assignee
Ayush Nayyar
Ayush NayyarReporter
Neil Huang
Neil HuangIs this a Regression?
No
Triage
Untriaged
Story Points
1
Priority
Instabug
Open Instabug
PagerDuty
PagerDuty
PagerDuty
Sentry
Sentry
Sentry
Zendesk Support
Zendesk Support
Zendesk Support
Created May 8, 2023 at 10:19 PM
Updated September 19, 2023 at 10:24 AM
Resolved June 6, 2023 at 5:39 PM
It is theoretically possible for checkpoint manager to spawn too many checkpoint go-routines and create out-of-order checkpoints when there’s a large number of slow target nodes.
There are some red herrings.
First, when looking at pprof files, we can notice an extraordinary amount of go-routies in this type of stack trace:
goxdcr 6.6.3
This is likely given that checkpoint manager fires off a checkpoint operation periodically:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#971
If a replication contains many targets, and if targets are slow, then it is possible for this operation here to take a long time:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
Theoretically, it could get stuck for a long time without any timeouts.
ExponentialBackoffExecutorWithFinishSignal
does not take into considerations timeout. Just retries.Second - there is an unequal number of “Start checkpointing” vs “Done Checkpointing”
In either case, it means that there is a bottleneck and there are more concurrent go-routines being stuck (or were stuck) than the periodic checkpoint interval.
If Num(Start) > Num(Done) -> there are go-routines building up
If Num(Done) > Num(Start) -> Previously built-up go-routines are finishing
How incorrect checkpoints could be created
In 1.9, golang introduced a “Starvation mode” for sync.Mutex (https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/mutex.go;l=44-66)
Starvation mode means that the lock could become a FIFO queue for callers if a caller waits for more than 1ms.
And when I checked the source for Golang 1.11.5, which is what XDCR 6.6.3 is based on, RWMutex does not have such starvation mode.
https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/rwmutex.go
However, the code in Checkpoint Manager here:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
uses RW Mutex. This means that the locking mechanism here is a “Thunder Lock” (https://hidetatz.io/articles/2021/04/07/go_mutex_starvation/)
The theory below is based on this concept.
When an operation within the lock takes a long time, such as https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406 that I mentioned here, it is theoretically possible for multiple go-routines to be spawned within this time.
Leading to the condition where there are more “Start checkpointing for replication” than “Done checkpointing for replication” in the logs, and leading to the growing numbers of go-routines in the stack trace that I mentioned above.
Checkpoint manager in a nutshell gathers data in the following way (given a batch of vbuckets):
Gets the throughSeqNo
Acquires the single lock to gain access to target memcached client
One at a time, retrieves each target VBUuid’s and seqnos.
When multiple go-routines build up, it is possible that:
Given throughSeqno increments from 1 to 2 to 3, etc… and target Seqno increments from A to B to C, etc
GoRoutine 1:
Gets the throughSeqno of 1
Acquires the single lock
Due to slow target, gets stuck for a long time.
(Note that
time_committed
does not account for the RPC call, just the local simple_store persistence time)In the meantime, CkptMgr launches more go-routines:
GoRoutine 2:
Gets the throughSeqno of 10. (At this point, the target high seqno theoretically is B).
Tries to acquire the lock and is blocked
<Some time passes>
GoRoutine 3:
Gets the throughSeqno of 100 (At this point, the target high seqno is theoretically C).
Tries to acquire the lock and is blocked
At this point, GoRoutine 1 finishes. The checkpoint is now consisted of one record (simplified):
GoRoutines 2 and 3 fight for the lock. 3 Gets it.
GoRoutine 3 acquires a target highSeqno of D, where D comes after C.
GoRoutine 2 finally gets the lock. Acquires a target highSeqno of F, where F comes after D.
Notice how the checkpoint is incorrect: The chronological order of the source is broken. An “older” record no longer necessarily represents an earlier point in time. (i.e <100, D>) in terms of the source. But it does represent an earlier point in time on the target.
At this point, I do not see this as a cause for data loss when a target VB fails over and gets a new VBUUID, but it may incur extra cost of re-streaming data.
Issue
Resolution
Checkpoint Manager created checkpoint records out-of-sequence when many target nodes ran slowly.
Checkpoint Manager now creates checkpoints in sequence when target nodes are slow.