Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3574

[Backup] Remove `staging-lock.lk` and `lock-<uuid>.lk`

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • operator-backup
    • None
    • 0

    Description

      What's the issue?

      2024-07-15T17:11:41 ERROR Stdout: b"Backup repository creation failed: failed to lock staging directory: timed out after 1m0s waiting for an exclusive lock to populate staging directory, please try again or use '--log-level debug' for more information"
      

      We recently hit this error when backing up our Prod DB, it was resolved by re-creating the physical volume used for the staging directory.

      This error is caused by a '/data/staging/staging-lock.lk' file being orphaned by a cancelled, killed or crashed pod.

      The pods subsequently created to run the backup fail with this error because they can't remove the lockfile, due to monotonic/ephemeral nature of the hostnames.

      What's the fix?
      In Capella - where we suffer the same problems - we strongly maintain the invariant that only one instance of 'cbbackpumgr' will be running at once, meaning these files can be removed at startup.

      The operator backup should do the same, as long as it can guarantee that only one instance of 'cbbackupmgr' is interacting with the staging directory/archive.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              justin.ashworth Justin Ashworth
              james.lee James Lee
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty