What's the issue?
Although we do have logic to handle unexpected failures such as power outages and a 'kill -9', it's currently built of unsafe assumptions, these assumptions are:
- That SQLite will be able to recover from a power outage (by default, this should be the case, however, we disable journals and syncing)
- The the 'RiftBufferedWriter' will be able to recover from a power outage (we use the 'sync_file_range' syscall as a performance optimization; this isn't safe on some filesystems as file metadata will not be written out as it would with 'fdatasync').
Some of our testing will prematurely 'kill -9' 'cbbackupmgr' in an effort to test resume support, this has lead to situations where an invalid/corrupt SQLite file is detected. See this case where one of our tests has failed due to a 'database disk image malformed'.
What's the fix?
Ideally, we should better handle these situations where possible:
- Move away from using the truncate-overwrite pattern (
- Enable syncing for SQLite (potentially enable journals, although we need to consider that not all journal types are supported on NFS)
- Periodically sync using 'fsync' or 'fdatasync' in the 'RiftBufferedWriter'