Details
-
Bug
-
Resolution: Unresolved
-
Major
-
4.5.0
-
None
-
Untriaged
-
Unknown
Description
The backup client is storing data from backups very similarly to how I think we will be storing data with ep-engine. I have a one or more files with vbucket partitioned between them and when there are multiple files there is a single writer that accesses each file. When I store data I store the key and value and encode the meta data into binary to keep it small. My meta data for example is always 32 bytes.
I loaded 1M items into the default bucket and also installed the travel-sample bucket. The travel sample bucket ends up being smaller when backed up (with compression) and is 24MB backed up vs. the 109MB on the server. The default bucket on the other hand is 213MB backed up and 139MB on the server.
After doing some investigation with Sundar it appears that ForestDB is copying some data, I'm presuming the key into the index, but Couchstore is referring to the key from the index. I suspect this possible data duplication is the cause for the difference in file sizes. See below for why I think this is the case.
Mikes-MacBook-Pro:default mikewied$ ~/couchbase/spock/install/bin/couch_dbinfo 0.couch.3
DB Info (0.couch.3) - header at 147456
file format version: 12
update_seq: 3912
purge_seq: 0
crc: CRC-32C
doc count: 978
deleted doc count: 0
data size: 140.1 kB
B-tree size: 60.6 kB
total disk size: 144.1 kB
forestdb_dump /tmp/backup/comp/2016-06-28T16_23_21.108829854-07_00/default-2c3f46599d3d4906ca20baff5a9e7adc/data/shard_0.fdb --header-only
- live index nodes: 16419 (67252224 bytes)
Total document size: 156032078 bytes, (index: 156032078 bytes, WAL: 0 bytes)
KV store name: partition0
- documents in the main index: 978, 0deleted / in WAL: 0 (insert), 0 (remove)
- live index nodes: 17 (69632 bytes)
Total document size: 152468 bytes
Last sequence number: 3912
The total size of the ForestDB file is around 213MB and from forestdb_dump we can see that the data and index take up most of this space (156032078 data + 67252224 index = about 213MB). If we look at the couchstore file then we can see for one vbucket the space the index and data consume are different (140.1 kB data + 60.6 kB index = 200.7kB) Notice however that the total size of the file is 144.1 kB. There is clearly some overlap in the counting and I suspect that it is because the couchstore index have references to the keys instead of copies.
I have a data loader that can be used to reproduce this setup and if needed please let me know and I can push it somewhere, but I think this same thing will happen no matter what the dataset is, as long as it contains 1M keys.