Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42359

cbbackupmgr hangs on EOF error

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 6.6.0
    • Fix Version/s: 7.0.0
    • Component/s: tools
    • Labels:
      None
    • Triage:
      Untriaged
    • Story Points:
      1
    • Is this a Regression?:
      Unknown

      Description

      While investigating a workaround for MB-42352 I came across this issue.

      Problem
      cbbackupmgr hangs when there is a EOF error:

      cbbackupmgr hangs

      /opt/couchbase/bin/cbbackupmgr restore -a backup -r zombie  -c 10.112.201.101 -u Administrator -p password
      (1/1) Restoring backup 2020-10-29T15_05_45.745791997Z '2020-10-29T15_05_45.745791997Z'
      Transferring key value data for 'test' at 101B/s (about 0s remaining)                                                      1 items / 28.20KB
      [================================================================================================================================== ] 99.81%
      

      backup.logs

      2020-10-29T18:16:57.018+00:00 WARN: (Pool) (test) Failed to send document with key '<ud>test</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350
      2020-10-29T18:16:57.018+00:00 WARN: (Pool) (test) Failed to send document with key '<ud>expiry</ud>' because of an unexpected EOF -- couchbase.(*MemcachedWorker).processOperation() at pool_worker.go:350
      2020-10-29T18:18:56.987+00:00 WARN: (Pool) (test) Memcached has been inactive for 1m0s, last item count 1 -- couchbase.(*MemcachedWorker).monitorActivity.func1() at pool_worker.go:411
      2020-10-29T18:21:45.746+00:00 Signal `interrupt` received, exiting
      

      Steps to reproduce

      Create the backup using a 6.0.4cluster and cbbackupmgr 6.6.0:
      1. Create a document with user xattrs and a 10 second TTL on Couchbase Server 6.0.3

       /opt/couchbase/bin/cbc-subdoc -U couchbase://localhost/test -u Administrator -P password
       subdoc> set test value -x xattr=100 -e 10
      

      2. Wait 10 seconds
      3. Take a backup

      /opt/couchbase/bin/cbbackupmgr config -a backup -r zombie
      /opt/couchbase/bin/cbbackupmgr backup -a backup -r zombie  -c 10.112.194.101 -u Administrator -p password
      

      For the restore setup a cluster on Couchbase Server 6.5.1, still using cbbackupmgr 6.6.0
      1. Change the allow_del_with_meta_prune_user_data config on the bucket

      /opt/couchbase/bin/cbepctl localhost:11210 -b test -u Administrator -p password set flush_param allow_del_with_meta_prune_user_data true
      

      2. Do the restore

      /opt/couchbase/bin/cbbackupmgr restore -a backup -r zombie  -c 10.112.201.101 -u Administrator -p password
      

      Expectations

      cbbackupmgr should not hang

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            pvarley Patrick Varley added a comment -

            Marking as a minor because to trigger this is pretty hard.

            Show
            pvarley Patrick Varley added a comment - Marking as a minor because to trigger this is pretty hard.
            Hide
            james.lee James Lee added a comment -

            Patrick Varley,

            I've had a look into this and believe I know what's happening, I'm able to reproduce the issue under artificial circumstances (modifying the code); if you are in fact hitting what I think you're hitting, this has already been fixed in master and should only possible under pretty rare circumstances. Below is what I believe is happening:

            1) Start restoring a small amount of data
            1a) The worker pool is started (in your case with a single worker; this creates an buffered error channel with room for a single error)
            2) The archive begins reading data from disk
            3) We fire of all the mutations for the restore (in this case two)
            4) We check the (archive) error stream; it's empty (because we read everything from disk successfully)
            5) We begin the teardown process, informing the worker pool it should complete the rest of its work and exit cleanly
            6) The worker pool begins teardown
            6a) It handles the first mutation, we hit an error; it gets put in the errors stream
            6b) It handles the second mutation, we hit another error; we block indefinitely attempting to put the error in the error channel

            To summarize, I believe this is a race condition that should only occur when the archive source has finished sending all of its mutations (without error) and begins waiting for the worker pool to finish, at which point it hits enough errors from the final mutations to fill up the error channel before exiting. This is due to the fact that once we begin waiting, there aren't any threads reading from the error channel.

            This has already been fixed in master because I did some work on the error propagation in CC where the buffered channel was changed to fit enough errors for every in-flight mutation to fail (one of these errors would then be propagated to the user once we'd finished teardown).

            Show
            james.lee James Lee added a comment - Patrick Varley , I've had a look into this and believe I know what's happening, I'm able to reproduce the issue under artificial circumstances (modifying the code); if you are in fact hitting what I think you're hitting, this has already been fixed in master and should only possible under pretty rare circumstances. Below is what I believe is happening: 1) Start restoring a small amount of data 1a) The worker pool is started (in your case with a single worker; this creates an buffered error channel with room for a single error) 2) The archive begins reading data from disk 3) We fire of all the mutations for the restore (in this case two) 4) We check the (archive) error stream; it's empty (because we read everything from disk successfully) 5) We begin the teardown process, informing the worker pool it should complete the rest of its work and exit cleanly 6) The worker pool begins teardown 6a) It handles the first mutation, we hit an error; it gets put in the errors stream 6b) It handles the second mutation, we hit another error; we block indefinitely attempting to put the error in the error channel To summarize, I believe this is a race condition that should only occur when the archive source has finished sending all of its mutations (without error) and begins waiting for the worker pool to finish, at which point it hits enough errors from the final mutations to fill up the error channel before exiting. This is due to the fact that once we begin waiting, there aren't any threads reading from the error channel. This has already been fixed in master because I did some work on the error propagation in CC where the buffered channel was changed to fit enough errors for every in-flight mutation to fail (one of these errors would then be propagated to the user once we'd finished teardown).
            Hide
            james.lee James Lee added a comment -

            Closing as fix as fixed as I believe this issue was resolved in the patch for MB-41372 where I increased the size of the error channel.

            Show
            james.lee James Lee added a comment - Closing as fix as fixed as I believe this issue was resolved in the patch for MB-41372 where I increased the size of the error channel.

              People

              Assignee:
              james.lee James Lee
              Reporter:
              pvarley Patrick Varley
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty