6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.0
Backups to EFS storage starting from 6.6.1 server version sometimes fail with
(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
On 7.0.0+ versions this error is a bit more verbose:
(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
- A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
- After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.
Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on
MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
sudo lsof -c cbbackupmgr -r1
After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by https://review.couchbase.org/c/gocbcore/+/135469, which is a bug fix for
GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").
Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
GOCBC-984: Fixed DCP backpressure to not block the client.
This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.
Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
- Increasing the value supplied for the "threads" parameter flag when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
- Backing up to regular disk storage and then copying the backup to the EFS storage
We can disable SQLite file locking by switching out the VFS for either unix-none, or win32-none which have no-op locking operations.
1) Backup uses "synchronous backfill" when streaming via DCP
2) This causes each vBucket to be backfilled from disk in-turn, rather than round-robin
3) We did this for backup to S3 (but it had benefits for EFS)
4) EFS only allows 256 file locks per-process
5) We use SQLite, which in-turn uses file locks
6) There'll be one lock per-vBucket being streamed at once
7) Synchronous backfill allowed us to avoid streaming too many vBuckets at once (and not hit this limit)
8) Synchronous backfill has an interesting side effect (
MB-39503) termed as interlacing where at the boundary between vBuckets, you could have multiple open at once
9) Prior to
GOCBC-984, this worked very well; the fix for GOCBC-984 was to use a buffered channel for the DCP buffer queue
10) With this change, we're more open to being effected by this interlacing (this appears due to the way the unbuffered channel affects DCP traffic)
11) As a result, we're seeing a customer hit the 256 lock limit on EFS (which wasn't an issue on 6.6.0 - confirmed by another customer)
12) This should generally only effect relatively small backups (larger backups are less prone to interlacing)