Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
0
Description
What's the issue?
2024-07-15T17:11:41 ERROR Stdout: b"Backup repository creation failed: failed to lock staging directory: timed out after 1m0s waiting for an exclusive lock to populate staging directory, please try again or use '--log-level debug' for more information"
|
We recently hit this error when backing up our Prod DB, it was resolved by re-creating the physical volume used for the staging directory.
This error is caused by a '/data/staging/staging-lock.lk' file being orphaned by a cancelled, killed or crashed pod.
The pods subsequently created to run the backup fail with this error because they can't remove the lockfile, due to monotonic/ephemeral nature of the hostnames.
What's the fix?
In Capella - where we suffer the same problems - we strongly maintain the invariant that only one instance of 'cbbackpumgr' will be running at once, meaning these files can be removed at startup.
The operator backup should do the same, as long as it can guarantee that only one instance of 'cbbackupmgr' is interacting with the staging directory/archive.