Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-54385

XDCR - Checkpoint manager can create out-of-order checkpoint records

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Morpheus
    • 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.0, 7.1.1
    • XDCR
    • None
    • Untriaged
    • 1
    • No

    Description

      It is theoretically possible for checkpoint manager to spawn too many checkpoint go-routines and create out-of-order checkpoints when there’s a large number of slow target nodes.

      There are some red herrings.
      First, when looking at pprof files, we can notice an extraordinary amount of go-routies in this type of stack trace:

      goxdcr 6.6.3

      ==============================================================================
      GoXDCR Go routine dump
      curl -sS --proxy  -K- http://127.0.0.1:9998/debug/pprof/goroutine?debug=1
      ==============================================================================
      goroutine profile: total 31366
      26819 @ 0x4301bb 0x430263 0x440b2c 0x44085d 0x47472f 0x4755bd 0x90d8ea 0x914021 0x45d7c1
      #       0x44085c        sync.runtime_SemacquireMutex+0x3c                                                                       /home/couchbase/.cbdepscache/exploded/x86_64/go-1.11.5/go/src/runtime/sema.go:71
      #       0x47472e        sync.(*Mutex).Lock+0xfe                                                                                 /home/couchbase/.cbdepscache/exploded/x86_64/go-1.11.5/go/src/sync/mutex.go:134
      #       0x4755bc        sync.(*RWMutex).Lock+0x2c                                                                               /home/couchbase/.cbdepscache/exploded/x86_64/go-1.11.5/go/src/sync/rwmutex.go:93
      #       0x90d8e9        github.com/couchbase/goxdcr/pipeline_svc.(*CheckpointManager).getHighSeqnoAndVBUuidFromTarget+0x59      /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go:405
      #       0x914020        github.com/couchbase/goxdcr/pipeline_svc.(*CheckpointManager).performCkpt+0x460                         /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go:1052
       
      1516 @ 0x4301bb 0x43ff86 0x8a4449 0x45d7c1
      #       0x8a4448        github.com/couchbase/goxdcr/component.(*AsyncComponentEventListenerImpl).start+0x258    /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/component/async_listener.go:69
       
      328 @ 0x4301bb 0x43ff86 0x8d1ce0 0x45d7c1
      #       0x8d1cdf        github.com/couchbase/goxdcr/parts.(*XmemNozzle).selfMonitor+0x1df       /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:2268
       
      328 @ 0x4301bb 0x43ff86 0x8d2f2b 0x45d7c1
      #       0x8d2f2a        github.com/couchbase/goxdcr/parts.(*XmemNozzle).checkAndRepairBufferMonitor+0x26a       /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/goxdcr/parts/xmem_nozzle.go:2338
      

      This is likely given that checkpoint manager fires off a checkpoint operation periodically:
      https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#971

      If a replication contains many targets, and if targets are slow, then it is possible for this operation here to take a long time:
      https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
      Theoretically, it could get stuck for a long time without any timeouts.
      ExponentialBackoffExecutorWithFinishSignal does not take into considerations timeout. Just retries.

      Second - there is an unequal number of “Start checkpointing” vs “Done Checkpointing”

      $ zipgrepCbCollectXdcr "Start checkpointing for replication" *743* | grep -c .
      1368
      $ zipgrepCbCollectXdcr "Done checkpointing for replication" *743* | grep -c .
      2642
      

      In either case, it means that there is a bottleneck and there are more concurrent go-routines being stuck (or were stuck) than the periodic checkpoint interval.
      If Num(Start) > Num(Done) -> there are go-routines building up
      If Num(Done) > Num(Start) -> Previously built-up go-routines are finishing

      How incorrect checkpoints could be created
      In 1.9, golang introduced a “Starvation mode” for sync.Mutex (https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/mutex.go;l=44-66)
      Starvation mode means that the lock could become a FIFO queue for callers if a caller waits for more than 1ms.

      And when I checked the source for Golang 1.11.5, which is what XDCR 6.6.3 is based on, RWMutex does not have such starvation mode.
      https://cs.opensource.google/go/go/+/refs/tags/go1.11.5:src/sync/rwmutex.go

      However, the code in Checkpoint Manager here:
      https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406
      uses RW Mutex. This means that the locking mechanism here is a “Thunder Lock” (https://hidetatz.io/articles/2021/04/07/go_mutex_starvation/)
      The theory below is based on this concept.

      When an operation within the lock takes a long time, such as https://src.couchbase.org/source/xref/6.6.3/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#404-406 that I mentioned here, it is theoretically possible for multiple go-routines to be spawned within this time.
      Leading to the condition where there are more “Start checkpointing for replication” than “Done checkpointing for replication” in the logs, and leading to the growing numbers of go-routines in the stack trace that I mentioned above.

      Checkpoint manager in a nutshell gathers data in the following way (given a batch of vbuckets):

      1. Gets the throughSeqNo
      2. Acquires the single lock to gain access to target memcached client
      3. One at a time, retrieves each target VBUuid’s and seqnos.

      When multiple go-routines build up, it is possible that:
      Given throughSeqno increments from 1 to 2 to 3, etc… and target Seqno increments from A to B to C, etc

      GoRoutine 1:

      1. Gets the throughSeqno of 1
      2. Acquires the single lock
      3. Due to slow target, gets stuck for a long time.
        (Note that time_committed does not account for the RPC call, just the local simple_store persistence time)

      In the meantime, CkptMgr launches more go-routines:

      GoRoutine 2:

      1. Gets the throughSeqno of 10. (At this point, the target high seqno theoretically is B).
      2. Tries to acquire the lock and is blocked

      <Some time passes>

      GoRoutine 3:

      1. Gets the throughSeqno of 100 (At this point, the target high seqno is theoretically C).
      2. Tries to acquire the lock and is blocked

      At this point, GoRoutine 1 finishes. The checkpoint is now consisted of one record (simplified):

      CkptRecords:
      <Source, Target>
      [
         <1, A>
      ]
      

      GoRoutines 2 and 3 fight for the lock. 3 Gets it.
      GoRoutine 3 acquires a target highSeqno of D, where D comes after C.

      CkptRecords (More recent on top):
      <Source, Target>
      [
         <100, D>
         <1, A>
      ]
      

      GoRoutine 2 finally gets the lock. Acquires a target highSeqno of F, where F comes after D.

      CkptRecords (More recent on top):
      <Source, Target>
      [
         <10, F>
         <100, D>
         <1, A>
      ]
      

      Notice how the checkpoint is incorrect: The chronological order of the source is broken. An “older” record no longer necessarily represents an earlier point in time. (i.e <100, D>) in terms of the source. But it does represent an earlier point in time on the target.

      At this point, I do not see this as a cause for data loss when a target VB fails over and gets a new VBUUID, but it may incur extra cost of re-streaming data.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty