[CBM] Improve resilience to power outage/kill -9 scenarios

Description

What's the issue?
Although we do have logic to handle unexpected failures such as power outages and a 'kill -9', it's currently built of unsafe assumptions, these assumptions are:

That SQLite will be able to recover from a power outage (by default, this should be the case, however, we disable journals and syncing)
The the 'RiftBufferedWriter' will be able to recover from a power outage (we use the 'sync_file_range' syscall as a performance optimization; this isn't safe on some filesystems as file metadata will not be written out as it would with 'fdatasync').

Example
Some of our testing will prematurely 'kill -9' 'cbbackupmgr' in an effort to test resume support, this has lead to situations where an invalid/corrupt SQLite file is detected. See this case where one of our tests has failed due to a 'database disk image malformed'.

What's the fix?
Ideally, we should better handle these situations where possible:

Move away from using the truncate-overwrite pattern ()
Enable syncing for SQLite (potentially enable journals, although we need to consider that not all journal types are supported on NFS)
Periodically sync using 'fsync' or 'fdatasync' in the 'RiftBufferedWriter'

Components

Affects versions

Fix versions

feature-backlog

Labels

None

Environment

None

Release Notes Description

None

Attachments

Linked issues

is duplicated by

MB-48485

[CBM] Failed to resume backup with error ' (bucket0) (vb 867) Received an unexpected error from the sink callback, beginning teardown'

relates to

MB-45685

[Backup Service] [Investigate] Database disk image malformed

MB-46878

Where possible 'cbbackupmgr' should avoid the truncate/overwrite pattern

Activity

James Lee April 13, 2021 at 9:42 AM

I'm not too worried about this issue at the moment; I've had a chance to look at the code and it looks like 'examinador' is using 'SIGKILL' to terminate the 'cbbackpmgr' process. With this in mind, the following information from the 'write' syscalls man page is interesting.

write man page

With that in mind, I suspect what's happened here, is that we've killed the backup process whilst it was actually creating an SQLite file; it's quite possible that we've got a partial SQLite file on disk (which would explain the error when attempting query the 'user_version'). Unfortunately we don't have the file so we won't be able to determine for certain so at the moment, this is just a hypothesis.

If this was the case, I don't believe there's much we could do to rectify this issue since 'SIGKILL' can't be "caught" and handled.

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
James Lee
Reporter
James Lee
Story Points
1
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created April 13, 2021 at 8:51 AM

Updated May 23, 2022 at 8:21 AM

Instabug

[CBM] Improve resilience to power outage/kill -9 scenarios

Description

Components

Affects versions

Fix versions

Labels

Environment

Release Notes Description

Attachments

Linked issues

is duplicated by

relates to

Activity

James Lee April 13, 2021 at 9:42 AM

DetailsAssigneeJames LeeJames LeeReporterJames LeeJames LeeStory Points1PriorityMajorInstabugOpen Instabug

Details

Assignee

Reporter

Story Points

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Details
Assignee
James Lee
Reporter
James Lee
Story Points
1
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support