Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
6.0.0, 6.0.1, 6.0.2, 6.0.3, 6.0.4, 6.0.5, 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, 7.0.0, 7.0.1, 7.1.0
-
None
-
1
Description
What's the issue?
Although we do have logic to handle unexpected failures such as power outages and a 'kill -9', it's currently built of unsafe assumptions, these assumptions are:
- That SQLite will be able to recover from a power outage (by default, this should be the case, however, we disable journals and syncing)
- The the 'RiftBufferedWriter' will be able to recover from a power outage (we use the 'sync_file_range' syscall as a performance optimization; this isn't safe on some filesystems as file metadata will not be written out as it would with 'fdatasync').
Example
Some of our testing will prematurely 'kill -9' 'cbbackupmgr' in an effort to test resume support, this has lead to situations where an invalid/corrupt SQLite file is detected. See this case where one of our tests has failed due to a 'database disk image malformed'.
What's the fix?
Ideally, we should better handle these situations where possible:
- Move away from using the truncate-overwrite pattern (
MB-46878) - Enable syncing for SQLite (potentially enable journals, although we need to consider that not all journal types are supported on NFS)
- Periodically sync using 'fsync' or 'fdatasync' in the 'RiftBufferedWriter'
Attachments
Issue Links
- is duplicated by
-
MB-48485 [CBM] Failed to resume backup with error ' (bucket0) (vb 867) Received an unexpected error from the sink callback, beginning teardown'
- Closed