Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51772

[CBM] Backup to EFS storage fails with "disk I/O error"

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 6.6.0, 6.6.1, 6.6.2, 7.0.0, 6.6.3, 7.1.0, 7.0.3, 7.0.2, 7.0.1, 6.6.5, 6.6.4, 7.0.4
    • 7.1.1
    • tools
    • Untriaged
    • 1
    • Unknown

    Description

      What is the issue?
      Backups to EFS storage starting from 6.6.1 server version sometimes fail with

      (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
      

      On 7.0.0+ versions this error is a bit more verbose:

      (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
      

      Known symptoms:

      • A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
      • After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

      What is causing the issue?
      Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using

      sudo lsof -c cbbackupmgr -r1
      

      What is the root cause?
      After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by https://review.couchbase.org/c/gocbcore/+/135469, which is a bug fix for GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").

      Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
      GOCBC-984: Fixed DCP backpressure to not block the client.
      

      This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

      What's the workaround?
      Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:

      1. Increasing the value supplied for the "threads" parameter flag when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
      2. Backing up to regular disk storage and then copying the backup to the EFS storage

      What's the fix?
      We can disable SQLite file locking by switching out the VFS for either unix-none, or win32-none which have no-op locking operations.

      TL;DR with some background context:
      1) Backup uses "synchronous backfill" when streaming via DCP
      2) This causes each vBucket to be backfilled from disk in-turn, rather than round-robin
      3) We did this for backup to S3 (but it had benefits for EFS)
      4) EFS only allows 256 file locks per-process
      5) We use SQLite, which in-turn uses file locks
      6) There'll be one lock per-vBucket being streamed at once
      7) Synchronous backfill allowed us to avoid streaming too many vBuckets at once (and not hit this limit)
      8) Synchronous backfill has an interesting side effect (MB-39503) termed as interlacing where at the boundary between vBuckets, you could have multiple open at once
      9) Prior to GOCBC-984, this worked very well; the fix for GOCBC-984 was to use a buffered channel for the DCP buffer queue
      10) With this change, we're more open to being effected by this interlacing (this appears due to the way the unbuffered channel affects DCP traffic)
      11) As a result, we're seeing a customer hit the 256 lock limit on EFS (which wasn't an issue on 6.6.0 - confirmed by another customer)
      12) This should generally only effect relatively small backups (larger backups are less prone to interlacing)

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            maks.januska Maksimiljans Januska created issue -
            maks.januska Maksimiljans Januska made changes -
            Field Original Value New Value
            Description +What is the issue?+
            Backups to EFS storage from 6.6.1-6.6.5 clusters sometimes fail with
            {noformat}
            (Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <bucket_no>: failed to open vBucket <bucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}

            One characteristic symptom of this failure is a slowdown in the backup progress bar, after which the backup fails with the error (or several in the logs) above.
            +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469].
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied to the "threads" argument when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            maks.januska Maksimiljans Januska made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            Fix Version/s Morpheus [ 17651 ]
            maks.januska Maksimiljans Januska made changes -
            Affects Version/s 7.1 [ 18333 ]
            Affects Version/s 7.0.3 [ 18033 ]
            Affects Version/s 7.0.2 [ 18012 ]
            Affects Version/s 7.0.1 [ 17104 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue causes CBSE-11692 [ CBSE-11692 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue is caused by GOCBC-984 [ GOCBC-984 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue relates to MB-39503 [ MB-39503 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue causes CBSE-11692 [ CBSE-11692 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue blocks CBSE-11692 [ CBSE-11692 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue is caused by GOCBC-984 [ GOCBC-984 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue is triggered by GOCBC-984 [ GOCBC-984 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue is triggered by GOCBC-984 [ GOCBC-984 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue is caused by GOCBC-984 [ GOCBC-984 ]
            maks.januska Maksimiljans Januska made changes -
            Description +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469].
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied to the "threads" argument when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied to the "threads" argument when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            maks.januska Maksimiljans Januska made changes -
            Description +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied to the "threads" argument when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984.
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied to the "threads" argument when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            maks.januska Maksimiljans Januska made changes -
            Description +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984.
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied to the "threads" argument when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984.
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied for the "threads" parameter falg when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            maks.januska Maksimiljans Januska made changes -
            Link This issue is caused by GOCBC-984 [ GOCBC-984 ]
            maks.januska Maksimiljans Januska made changes -
            Link This issue relates to GOCBC-984 [ GOCBC-984 ]
            maks.januska Maksimiljans Januska made changes -
            Description +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984.
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied for the "threads" parameter falg when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied for the "threads" parameter falg when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            maks.januska Maksimiljans Januska made changes -
            Description +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied for the "threads" parameter falg when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.
            +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied for the "threads" parameter falg when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.

            TL;DR with some background context:
            1) Backup uses "synchronous backfill" when streaming via DCP
            2) This causes each vBucket to be backfilled from disk in-turn, rather than round-robin
            3) We did this for backup to S3 (but it had benefits for EFS)
            4) EFS only allows 256 file locks per-process
            5) We use SQLite, which in-turn uses file locks
            6) There'll be one lock per-vBucket being streamed at once
            7) Synchronous backfill allowed us to avoid streaming too many vBuckets at once (and not hit this limit)
            8) Synchronous backfill has an interesting side effect (MB-39503) termed as interlacing where at the boundary between vBuckets, you could have multiple open at once
            9) Prior to GOCBC-984, this worked very well; the fix for GOCBC-984 was to use a buffered channel for the DCP buffer queue
            10) With this change, we're more open to being effected by this interlacing (this appears due to the way the unbuffered channel affects DCP traffic)
            11) As a result, we're seeing a customer hit the 256 lock limit on EFS (which wasn't an issue on 6.6.0 - confirmed by another customer)
            12) This should generally only effect relatively small backups (larger backups are less prone to interlacing)
            maks.januska Maksimiljans Januska made changes -
            Description +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied for the "threads" parameter falg when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.

            TL;DR with some background context:
            1) Backup uses "synchronous backfill" when streaming via DCP
            2) This causes each vBucket to be backfilled from disk in-turn, rather than round-robin
            3) We did this for backup to S3 (but it had benefits for EFS)
            4) EFS only allows 256 file locks per-process
            5) We use SQLite, which in-turn uses file locks
            6) There'll be one lock per-vBucket being streamed at once
            7) Synchronous backfill allowed us to avoid streaming too many vBuckets at once (and not hit this limit)
            8) Synchronous backfill has an interesting side effect (MB-39503) termed as interlacing where at the boundary between vBuckets, you could have multiple open at once
            9) Prior to GOCBC-984, this worked very well; the fix for GOCBC-984 was to use a buffered channel for the DCP buffer queue
            10) With this change, we're more open to being effected by this interlacing (this appears due to the way the unbuffered channel affects DCP traffic)
            11) As a result, we're seeing a customer hit the 256 lock limit on EFS (which wasn't an issue on 6.6.0 - confirmed by another customer)
            12) This should generally only effect relatively small backups (larger backups are less prone to interlacing)
            +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied for the "threads" parameter falg when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.

            +*TL;DR* with some background context:+
            1) Backup uses "synchronous backfill" when streaming via DCP
            2) This causes each vBucket to be backfilled from disk in-turn, rather than round-robin
            3) We did this for backup to S3 (but it had benefits for EFS)
            4) EFS only allows 256 file locks per-process
            5) We use SQLite, which in-turn uses file locks
            6) There'll be one lock per-vBucket being streamed at once
            7) Synchronous backfill allowed us to avoid streaming too many vBuckets at once (and not hit this limit)
            8) Synchronous backfill has an interesting side effect (MB-39503) termed as interlacing where at the boundary between vBuckets, you could have multiple open at once
            9) Prior to GOCBC-984, this worked very well; the fix for GOCBC-984 was to use a buffered channel for the DCP buffer queue
            10) With this change, we're more open to being effected by this interlacing (this appears due to the way the unbuffered channel affects DCP traffic)
            11) As a result, we're seeing a customer hit the 256 lock limit on EFS (which wasn't an issue on 6.6.0 - confirmed by another customer)
            12) This should generally only effect relatively small backups (larger backups are less prone to interlacing)
            james.lee James Lee made changes -
            Description +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied for the "threads" parameter falg when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            At this moment, we are not sure what the most suitable fix for this issue is going to be, we have a potential solution in mind but we need to discuss this with the SDK team.

            +*TL;DR* with some background context:+
            1) Backup uses "synchronous backfill" when streaming via DCP
            2) This causes each vBucket to be backfilled from disk in-turn, rather than round-robin
            3) We did this for backup to S3 (but it had benefits for EFS)
            4) EFS only allows 256 file locks per-process
            5) We use SQLite, which in-turn uses file locks
            6) There'll be one lock per-vBucket being streamed at once
            7) Synchronous backfill allowed us to avoid streaming too many vBuckets at once (and not hit this limit)
            8) Synchronous backfill has an interesting side effect (MB-39503) termed as interlacing where at the boundary between vBuckets, you could have multiple open at once
            9) Prior to GOCBC-984, this worked very well; the fix for GOCBC-984 was to use a buffered channel for the DCP buffer queue
            10) With this change, we're more open to being effected by this interlacing (this appears due to the way the unbuffered channel affects DCP traffic)
            11) As a result, we're seeing a customer hit the 256 lock limit on EFS (which wasn't an issue on 6.6.0 - confirmed by another customer)
            12) This should generally only effect relatively small backups (larger backups are less prone to interlacing)
            +What is the issue?+
            Backups to EFS storage starting from 6.6.1 server version sometimes fail with
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vucket_no>: failed to open index: failed to set 'user_version': disk I/O error
            {noformat}
            On 7.0.0+ versions this error is a bit more verbose:
            {noformat}(Cmd) Error backing up cluster: failed to execute cluster operations: failed to execute bucket operations: failed to transfer bucket data for bucket '<bucket_name>': failed to transfer key value data: failed to transfer key value data: failed to open vBucket: failed to open vBucket <vbucket_no>: failed to open vBucket <vbucket_no>: failed to open index: failed to set 'user_version': disk I/O error: disk quota exceeded
            {noformat}
            +Known symptoms:+
             * A slowdown in the backup progress bar, after which the backup fails with the error (or several for different vBuckets, can be seen in the logs) mentioned above.
             * After careful examination of the first symptom we established that the slowdown is caused by backup files (index and rift data) being opened and not closed by the cbbackupmgr process, which results in it acquiring more and more file handles for different vBuckets.

            +What is causing the issue?+
            Acquiring file handles for files that correspond to more than one vBucket is a know cbbackupmgr quirk (this shouldn't happen when "syncronous backfill" is used) that is caused by "interleaving" on the KV-Engine side of things (best explained by Dave Rigby in a comment on MB-39503). This is only something that happens for relatively small backups and for normal disk storage this shouldn't cause any issues, however, EFS has a file lock limit, which can be reached if too many vBuckets connections, and therefore files on EFS storage, are open at the same time. This is exactly what is happening in this case, you can confirm this by examining the number of files that are opened by the cbbackupmgr process just before the error above is thrown using
            {noformat}sudo lsof -c cbbackupmgr -r1
            {noformat}
            +What is the root cause?+
            After searching for the build that had introduced the issue we have managed to establish that it had been introduced in 6.6.1-9153. Based on the fact that the only relevant change in-between builds 9152 and 9153 is an update of the gocbcore (Couchbase Go SDK) revision in the build manifest, we are sure that this is caused by [https://review.couchbase.org/c/gocbcore/+/135469], which is a bug fix for GOCBC-984 (there is no suitable link type for this so I am putting this as "relates to").
            {noformat}Build couchbase-server-6.6.1-9153 contains gocbcore commit 00c424f with commit message:
            GOCBC-984: Fixed DCP backpressure to not block the client.
            {noformat}
            This patch changes the channel through which DCP packets are send back to the DCP client from non-buffered to a buffered one, which results in it no longer being blocking.

            +What's the workaround?+
            Currently we don't think we can solve this on the cbbackupmgr side but we are going to discuss potential solutions with the SDK team. Currently the potential workarounds are:
             # Increasing the value supplied for the "threads" parameter flag when performing a backup (this could help depending on the size of the backup and is not guaranteed to work)
             # Backing up to regular disk storage and then copying the backup to the EFS storage

            +What's the fix?+
            We can disable SQLite file locking by switching out the VFS for either unix-none, or win32-none which have no-op locking operations.

            +*TL;DR* with some background context:+
            1) Backup uses "synchronous backfill" when streaming via DCP
            2) This causes each vBucket to be backfilled from disk in-turn, rather than round-robin
            3) We did this for backup to S3 (but it had benefits for EFS)
            4) EFS only allows 256 file locks per-process
            5) We use SQLite, which in-turn uses file locks
            6) There'll be one lock per-vBucket being streamed at once
            7) Synchronous backfill allowed us to avoid streaming too many vBuckets at once (and not hit this limit)
            8) Synchronous backfill has an interesting side effect (MB-39503) termed as interlacing where at the boundary between vBuckets, you could have multiple open at once
            9) Prior to GOCBC-984, this worked very well; the fix for GOCBC-984 was to use a buffered channel for the DCP buffer queue
            10) With this change, we're more open to being effected by this interlacing (this appears due to the way the unbuffered channel affects DCP traffic)
            11) As a result, we're seeing a customer hit the 256 lock limit on EFS (which wasn't an issue on 6.6.0 - confirmed by another customer)
            12) This should generally only effect relatively small backups (larger backups are less prone to interlacing)
            james.lee James Lee made changes -
            Link This issue is duplicated by MB-51937 [ MB-51937 ]
            james.lee James Lee made changes -
            Affects Version/s 7.1 [ 18333 ]
            Affects Version/s 7.1.0 [ 18356 ]
            Affects Version/s 7.0.4 [ 18322 ]
            Affects Version/s 6.6.0 [ 16787 ]
            Affects Version/s 7.0.0 [ 17233 ]
            james.lee James Lee made changes -
            Fix Version/s Morpheus [ 17651 ]
            Fix Version/s 7.1.1 [ 18320 ]
            james.lee James Lee made changes -
            Labels candidate-for-7.1.1
            james.lee James Lee made changes -
            Assignee Maksimiljans Januska [ JIRAUSER26064 ] James Lee [ james.lee ]
            owend Daniel Owen made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            owend Daniel Owen made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            owend Daniel Owen made changes -
            Labels candidate-for-7.1.1 approved-for-7.1.1 candidate-for-7.1.1
            owend Daniel Owen made changes -
            Link This issue blocks MB-51648 [ MB-51648 ]
            james.lee James Lee made changes -
            Assignee James Lee [ james.lee ] Maksimiljans Januska [ JIRAUSER26064 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            lynn.straus Lynn Straus made changes -
            Affects Version/s 7.1.0.x [ 18356 ]
            Affects Version/s 7.1.0 [ 17615 ]
            james.lee James Lee made changes -
            Link This issue relates to CBSE-11969 [ CBSE-11969 ]
            wayne Wayne Siu made changes -
            Link This issue blocks MB-52510 [ MB-52510 ]
            lynn.straus Lynn Straus made changes -
            Link This issue blocks MB-51648 [ MB-51648 ]
            chanabasappa.ghali Chanabasappa Ghali made changes -
            Assignee Maksimiljans Januska [ JIRAUSER26064 ] Gilad Kalchheim [ JIRAUSER30694 ]
            gilad.kalchheim Gilad Kalchheim made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              gilad.kalchheim Gilad Kalchheim
              maks.januska Maksimiljans Januska
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty