[CBM] Improve resilience to power outage/kill -9 scenarios
Description
Components
Fix versions
Labels
Environment
Release Notes Description
is duplicated by
Activity
James Lee April 13, 2021 at 9:42 AM
I'm not too worried about this issue at the moment; I've had a chance to look at the code and it looks like 'examinador
' is using 'SIGKILL
' to terminate the 'cbbackpmgr
' process. With this in mind, the following information from the 'write
' syscalls man page is interesting.
write man page
With that in mind, I suspect what's happened here, is that we've killed the backup process whilst it was actually creating an SQLite file; it's quite possible that we've got a partial SQLite file on disk (which would explain the error when attempting query the 'user_version
'). Unfortunately we don't have the file so we won't be able to determine for certain so at the moment, this is just a hypothesis.
If this was the case, I don't believe there's much we could do to rectify this issue since 'SIGKILL
' can't be "caught" and handled.
What's the issue?
Although we do have logic to handle unexpected failures such as power outages and a '
kill -9
', it's currently built of unsafe assumptions, these assumptions are:That SQLite will be able to recover from a power outage (by default, this should be the case, however, we disable journals and syncing)
The the '
RiftBufferedWriter
' will be able to recover from a power outage (we use the 'sync_file_range
' syscall as a performance optimization; this isn't safe on some filesystems as file metadata will not be written out as it would with 'fdatasync
').Example
Some of our testing will prematurely '
kill -9
' 'cbbackupmgr
' in an effort to test resume support, this has lead to situations where an invalid/corrupt SQLite file is detected. See this case where one of our tests has failed due to a 'database disk image malformed
'.What's the fix?
Ideally, we should better handle these situations where possible:
Move away from using the truncate-overwrite pattern ()
Enable syncing for SQLite (potentially enable journals, although we need to consider that not all journal types are supported on NFS)
Periodically sync using '
fsync
' or 'fdatasync
' in the 'RiftBufferedWriter
'