Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: operator-backup
Labels:
None

Story Points:
0

Description

What's the issue?

2024-07-15T17:11:41 ERROR Stdout: b"Backup repository creation failed: failed to lock staging directory: timed out after 1m0s waiting for an exclusive lock to populate staging directory, please try again or use '--log-level debug' for more information"

We recently hit this error when backing up our Prod DB, it was resolved by re-creating the physical volume used for the staging directory.

This error is caused by a '/data/staging/staging-lock.lk' file being orphaned by a cancelled, killed or crashed pod.

The pods subsequently created to run the backup fail with this error because they can't remove the lockfile, due to monotonic/ephemeral nature of the hostnames.

What's the fix?
In Capella - where we suffer the same problems - we strongly maintain the invariant that only one instance of 'cbbackpumgr' will be running at once, meaning these files can be removed at startup.

The operator backup should do the same, as long as it can guarantee that only one instance of 'cbbackupmgr' is interacting with the staging directory/archive.

Attachments

Issue Links

depends on

K8S-3498 Gracefully Handle Overlapping Backups

Open

relates to: AV-82142 Loading...

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Justin Ashworth

Reporter:: James Lee

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Jul/24 10:28 AM

Updated:: 07/Aug/24 2:01 PM

Gerrit Reviews

There are no open Gerrit changes

[Backup] Remove `staging-lock.lk` and `lock-<uuid>.lk`

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty