Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51772

[CBM] Backup to EFS storage fails with "disk I/O error"

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      What is the issue?
      Backups to EFS storage starting from 6.6.1 server version sometimes fail with

      (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
      

      On 7.0.0+ versions this error is a bit more verbose:

      (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
      

      Known symptoms:

      • A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
      • After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

      What is causing the issue?
      Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using

      sudo lsof -c cbbackupmgr -r1
      

      What is the root cause?
      After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by https://review.couchbase.org/c/gocbcore/+/135469, which is a bug fix for GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").

      Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
      GOCBC-984: Fixed DCP backpressure to not block the client.
      

      This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

      What's the workaround?
      Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:

      1. Increasing the value supplied for the "threads" parameter flag when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
      2. Backing up to regular disk storage and then copying the backup to the EFS storage

      What's the fix?
      We can disable SQLite file locking by switching out the VFS for either unix-none, or win32-none which have no-op locking operations.

      TL;DR with some background context:
      1) Backup uses "synchronous backfill" when streaming via DCP
      2) This causes each vBucket to be backfilled from disk in-turn, rather than round-robin
      3) We did this for backup to S3 (but it had benefits for EFS)
      4) EFS only allows 256 file locks per-process
      5) We use SQLite, which in-turn uses file locks
      6) There'll be one lock per-vBucket being streamed at once
      7) Synchronous backfill allowed us to avoid streaming too many vBuckets at once (and not hit this limit)
      8) Synchronous backfill has an interesting side effect (MB-39503) termed as interlacing where at the boundary between vBuckets, you could have multiple open at once
      9) Prior to GOCBC-984, this worked very well; the fix for GOCBC-984 was to use a buffered channel for the DCP buffer queue
      10) With this change, we're more open to being effected by this interlacing (this appears due to the way the unbuffered channel affects DCP traffic)
      11) As a result, we're seeing a customer hit the 256 lock limit on EFS (which wasn't an issue on 6.6.0 - confirmed by another customer)
      12) This should generally only effect relatively small backups (larger backups are less prone to interlacing)

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-51772
          # Subject Branch Project Status CR V

          Activity

            People

              gilad.kalchheim Gilad Kalchheim
              maks.januska Maksimiljans Januska
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty