Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: feature-backlog
Affects Version/s: 6.0.0, 6.0.1, 6.0.2, 6.0.3, 6.0.4, 6.0.5, 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, 7.0.0, 7.0.1, 7.1.0
Component/s: tools
Labels:
None

Story Points:
1

Description

What's the issue?
Although we do have logic to handle unexpected failures such as power outages and a 'kill -9', it's currently built of unsafe assumptions, these assumptions are:

That SQLite will be able to recover from a power outage (by default, this should be the case, however, we disable journals and syncing)
The the 'RiftBufferedWriter' will be able to recover from a power outage (we use the 'sync_file_range' syscall as a performance optimization; this isn't safe on some filesystems as file metadata will not be written out as it would with 'fdatasync').

Example
Some of our testing will prematurely 'kill -9' 'cbbackupmgr' in an effort to test resume support, this has lead to situations where an invalid/corrupt SQLite file is detected. See this case where one of our tests has failed due to a 'database disk image malformed'.

What's the fix?
Ideally, we should better handle these situations where possible:

Move away from using the truncate-overwrite pattern (~~MB-46878~~)
Enable syncing for SQLite (potentially enable journals, although we need to consider that not all journal types are supported on NFS)
Periodically sync using 'fsync' or 'fdatasync' in the 'RiftBufferedWriter'

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cbbackupmgr-collectinfo-backups-2021-04-13T082352.zip
142 kB
13/Apr/21 1:57 AM
collectinfo-2021-04-13T083943-n_0@cb.local.zip
9.96 MB
13/Apr/21 1:58 AM
collectinfo-2021-04-13T083120-n_0@cb.local.zip
24.73 MB
13/Apr/21 1:58 AM
report.html
245 kB
13/Apr/21 1:58 AM
log.html
4.11 MB
13/Apr/21 1:58 AM
output.xml
24.57 MB
13/Apr/21 1:59 AM

Issue Links

is duplicated by

MB-48485 [CBM] Failed to resume backup with error ' (bucket0) (vb 867) Received an unexpected error from the sink callback, beginning teardown'

Closed

relates to

MB-46878 Where possible 'cbbackupmgr' should avoid the truncate/overwrite pattern

Closed

MB-45685 [Backup Service] [Investigate] Database disk image malformed

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: James Lee

Reporter:: James Lee

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Apr/21 1:51 AM

Updated:: 23/May/22 1:21 AM

Gerrit Reviews

There are no open Gerrit changes

[CBM] Improve resilience to power outage/kill -9 scenarios

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty