Details
-
Task
-
Resolution: Unresolved
-
Major
-
7.1.0
-
None
Description
What's the issue?
When performing a merge with cbbackupmgr we move sequentially through the backups in the repository in chronological order aggregating the backups into a single merged backup. Doing so in this fashion means that we may be writing mutations for keys multiple times before we complete the merge. We could improve the efficiency of the process by going through the backups in reverse order only adding keys that do not exist in the merged backup to the merged backup (determining if a key has already been merge may well be expensive and might mean this is not an effective way of merging data).
What problems does this solve?
- It will mean Rift data stores are automatically compacted (the data will be written sequentially leaving no dead space).
- We might be able to extend similar logic to restore so we don't need to send every mutation in order to get to a consistent state.
Edge cases
- When restoring we need to be careful with deletions. If you imagine the case where you have a mutation, another mutation and a deletion then the deletion will be sent first, then the mutation which will trump the deletion. This state is clearly invalid, so we need to stop it happening. This could be done by keeping track of what deletions are in a backup, and then rejecting a mutation if a later backup had a deletion for that key.
- --force-updates won't work as it will always take the earliest mutation. Fixing this and keeping reverse restore would require knowing whether we've already transferred a mutation for a given key. This is likely to be too expensive so we may need to go back to forward restore for this case.
Optimisations
- When restoring we could just rely on conflict resolution to reject the mutations for keys we've already transferred. This may already be quicker than forward restore as kv wouldn't need to persist the mutation. It would be nicer if we didn't transfer it all however, so we could employ an LRU cache of keys we've sent. Note that something similar also applies to merging - if we're happy to allow some duplication we can trade disk usage off for a quicker merge.
- Given edge case #1 we need to know what backups have what deletions in them. We could do this by looking them up in the index files but it might be quicker to build a bloom filter for each backup before starting the restore.