Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 7.1.1
Affects Version/s: 6.6.0, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 6.6.5, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.0
Component/s: tools
Labels:
- approved-for-7.1.1
- candidate-for-7.1.1

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown

Description

What is the issue?
Backups to EFS storage starting from 6.6.1 server version sometimes fail with

(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error

On 7.0.0+ versions this error is a bit more verbose:

(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded

Known symptoms:

A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

What is causing the issue?
Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on ~~MB-39503~~). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using

sudo lsof -c cbbackupmgr -r1

What is the root cause?
After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by https://review.couchbase.org/c/gocbcore/+/135469, which is a bug fix for ~~GOCBC-984~~ (there is no suitable link type for this so I am putting this as "relates to").

Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:

GOCBC-984: Fixed DCP backpressure to not block the client.

This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

What's the workaround?
Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:

Increasing the value supplied for the "threads" parameter flag when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
Backing up to regular disk storage and then copying the backup to the EFS storage

What's the fix?
We can disable SQLite file locking by switching out the VFS for either unix-none, or win32-none which have no-op locking operations.

TL;DR with some background context:
1) Backup uses "synchronous backfill" when streaming via DCP
2) This causes each vBucket to be backfilled from disk in-turn, rather than round-robin
3) We did this for backup to S3 (but it had benefits for EFS)
4) EFS only allows 256 file locks per-process
5) We use SQLite, which in-turn uses file locks
6) There'll be one lock per-vBucket being streamed at once
7) Synchronous backfill allowed us to avoid streaming too many vBuckets at once (and not hit this limit)
8) Synchronous backfill has an interesting side effect (~~MB-39503~~) termed as interlacing where at the boundary between vBuckets, you could have multiple open at once
9) Prior to ~~GOCBC-984~~, this worked very well; the fix for ~~GOCBC-984~~ was to use a buffered channel for the DCP buffer queue
10) With this change, we're more open to being effected by this interlacing (this appears due to the way the unbuffered channel affects DCP traffic)
11) As a result, we're seeing a customer hit the 256 lock limit on EFS (which wasn't an issue on 6.6.0 - confirmed by another customer)
12) This should generally only effect relatively small backups (larger backups are less prone to interlacing)

Attachments

Issue Links

is duplicated by

MB-51937 [CBM] [EFS] Disable SQLite file locking

Closed

relates to

GOCBC-984 DCP backpressure can cause broken client

Resolved

MB-39503 Receiving interlaced vBucket mutations when using sequential backfill

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Gilad Kalchheim

Reporter:: Maksimiljans Januska

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 08/Apr/22 2:00 AM

Updated:: 20/Jun/22 10:38 PM

Resolved:: 06/May/22 3:17 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-51772 Disable SQLite file locking on Unix/Windows: Gerrit Review:

Merge branch 'neo' into master: Gerrit Review:

[CBM] Backup to EFS storage fails with "disk I/O error"

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty