Description
What is the problem?
cbbackupmgr has a merge command which is intended to reduce disk space by deduplicated mutations for every key across a series of backups. Unfortunately currently the merge command does not significantly reduce the disk space used as it does not do this deduplication in the data file.
In SQLite/ForestDB we got the deduplication for "free" because the document value was stored in the index, and each key had only one entry in the index. In Rift the index and the data are split, and the data file is append-only. This means the same document can be appended to the data file multiple times, even if it is only in the index once.
What is the solution?
A couple of ideas (both from James Lee):
- Merge backwards. If we do this then we know if we ever see a key for a second time we can just ignore it. This isn't true when merging forwards because we always want to take the last mutation/deletion associated with a key
- Do the merge as normal but afterwards dedup the data file