Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 7.2.1
Affects Version/s: 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.0, 7.1.1
Component/s: XDCR
Labels:
- approved-for-7.2.1
- candidate-for-7.2.1

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
No

Description

It is theoretically possible for checkpoint manager to spawn too many checkpoint go-routines and create out-of-order checkpoints when there’s a large number of slow target nodes.

There are some red herrings.
First, when looking at pprof files, we can notice an extraordinary amount of go-routies in this type of stack trace:

goxdcr 6.6.3
==============================================================================
GoXDCR Go routine dump
curl -sS --proxy -K- http://127.0.0.1:9998/debug/pprof/goroutine?debug=1
==============================================================================
goroutine profile: total 31366
26819 @ 0x4301bb 0x430263 0x440b2c 0x44085d 0x47472f 0x4755bd 0x90d8ea 0x914021 0x45d7c1
# 0x44085c sync.runtime_SemacquireMutex+0x3c /home/couchbase/.cbdepscache/exploded/x86_64/go-1.11.5/go/src/runtime/sema.go:71
# 0x47472e sync.(*Mutex).Lock+0xfe /home/couchbase/.cbdepscache/exploded/x86_64/go-1.11.5/go/src/sync/mutex.go:134
# 0x4755bc sync.(*RWMutex).Lock+0x2c /home/couchbase/.cbdepscache/exploded/x86_64/go-1.11.5/go/src/sync/rwmutex.go:93
# 0x90d8e9 github.com/couchbase/goxdcr/pipeline_svc.(*CheckpointManager).getHighSeqnoAndVBUuidFromTarget+0x59 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go:405
# 0x914020 github.com/couchbase/goxdcr/pipeline_svc.(*CheckpointManager).performCkpt+0x460 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go:1052

1516 @ 0x4301bb 0x43ff86 0x8a4449 0x45d7c1
# 0x8a4448 github.com/couchbase/goxdcr/component.(*AsyncComponentEventListenerImpl).start+0x258 /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:69

328 @ 0x4301bb 0x43ff86 0x8d1ce0 0x45d7c1
# 0x8d1cdf github.com/couchbase/goxdcr/parts.(*XmemNozzle).selfMonitor+0x1df /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:2268

328 @ 0x4301bb 0x43ff86 0x8d2f2b 0x45d7c1
# 0x8d2f2a github.com/couchbase/goxdcr/parts.(*XmemNozzle).checkAndRepairBufferMonitor+0x26a /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:2338

This is likely given that checkpoint manager fires off a checkpoint operation periodically:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#971

If a replication contains many targets, and if targets are slow, then it is possible for this operation here to take a long time:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
Theoretically, it could get stuck for a long time without any timeouts.
ExponentialBackoffExecutorWithFinishSignal does not take into considerations timeout. Just retries.

Second - there is an unequal number of “Start checkpointing” vs “Done Checkpointing”

$ zipgrepCbCollectXdcr "Start checkpointing for replication" *743* | grep -c .

$ zipgrepCbCollectXdcr "Done checkpointing for replication" *743* | grep -c .

In either case, it means that there is a bottleneck and there are more concurrent go-routines being stuck (or were stuck) than the periodic checkpoint interval.
If Num(Start) > Num(Done) -> there are go-routines building up
If Num(Done) > Num(Start) -> Previously built-up go-routines are finishing

How incorrect checkpoints could be created
In 1.9, golang introduced a “Starvation mode” for sync.Mutex (https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/mutex.go;l=44-66)
Starvation mode means that the lock could become a FIFO queue for callers if a caller waits for more than 1ms.

And when I checked the source for Golang 1.11.5, which is what XDCR 6.6.3 is based on, RWMutex does not have such starvation mode.
https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/rwmutex.go

However, the code in Checkpoint Manager here:
https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
uses RW Mutex. This means that the locking mechanism here is a “Thunder Lock” (https://hidetatz.io/articles/2021/04/07/go_mutex_starvation/)
The theory below is based on this concept.

When an operation within the lock takes a long time, such as https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406 that I mentioned here, it is theoretically possible for multiple go-routines to be spawned within this time.
Leading to the condition where there are more “Start checkpointing for replication” than “Done checkpointing for replication” in the logs, and leading to the growing numbers of go-routines in the stack trace that I mentioned above.

Checkpoint manager in a nutshell gathers data in the following way (given a batch of vbuckets):

Gets the throughSeqNo
Acquires the single lock to gain access to target memcached client
One at a time, retrieves each target VBUuid’s and seqnos.

When multiple go-routines build up, it is possible that:
Given throughSeqno increments from 1 to 2 to 3, etc… and target Seqno increments from A to B to C, etc

GoRoutine 1:

Gets the throughSeqno of 1
Acquires the single lock
Due to slow target, gets stuck for a long time.
(Note that time_committed does not account for the RPC call, just the local simple_store persistence time)

In the meantime, CkptMgr launches more go-routines:

GoRoutine 2:

Gets the throughSeqno of 10. (At this point, the target high seqno theoretically is B).
Tries to acquire the lock and is blocked

GoRoutine 3:

Gets the throughSeqno of 100 (At this point, the target high seqno is theoretically C).
Tries to acquire the lock and is blocked

At this point, GoRoutine 1 finishes. The checkpoint is now consisted of one record (simplified):

CkptRecords:

<Source, Target>

   <1, A>

GoRoutines 2 and 3 fight for the lock. 3 Gets it.
GoRoutine 3 acquires a target highSeqno of D, where D comes after C.

CkptRecords (More recent on top):

<Source, Target>

   <100, D>

   <1, A>

GoRoutine 2 finally gets the lock. Acquires a target highSeqno of F, where F comes after D.

CkptRecords (More recent on top):

<Source, Target>

   <10, F>

   <100, D>

   <1, A>

Notice how the checkpoint is incorrect: The chronological order of the source is broken. An “older” record no longer necessarily represents an earlier point in time. (i.e <100, D>) in terms of the source. But it does represent an earlier point in time on the target.

At this point, I do not see this as a cause for data loss when a target VB fails over and gets a new VBUUID, but it may incur extra cost of re-streaming data.

Issue	Resolution
Checkpoint Manager created checkpoint records out-of-sequence when many target nodes ran slowly.	Checkpoint Manager now creates checkpoints in sequence when target nodes are slow.

Attachments

Issue Links

is a backport of

MB-54385 XDCR - Checkpoint manager can create out-of-order checkpoint records

Closed

[BP 7.2.1] - XDCR - Checkpoint manager can create out-of-order checkpoint records

Details

Description

Attachments

Issue Links

Activity

People

Dates

PagerDuty