Summary
The couchstore fragmentation calculation is not taking into account completed Prepares. As such, auto-compaction is not run when expected, and hence the completed (no longer needed) prepares are not purged.
Details
Using pilowfight to load 4M documents into two buckets; one using level=none, one with level=majority (same eviction policy, same compaction threshold of 10%):
level=none
|
cbc-pillowfight -U 127.0.0.1:9000/default -u Administrator -P asdasd --num-items=4000000 -m 4096 -M 4096 --random-body --populate-only --num-threads=10
|
level=majority
|
cbc-pillowfight -U 127.0.0.1:9000/majority -u Administrator -P asdasd --num-items=4000000 -m 4096 -M 4096 --random-body --populate-only --num-threads=20 --durability=majority
|
(Note I also ran with a reduced vBucket count of 4; so make it easier to load a large number of documents (1M) per vBucket.)
This in the level=majority load having 2x the disk space:
- level=none
$ couch_dbinfo --local default/0.couch.1
|
DB Info (default/0.couch.1) - total disk size: 4.007 GB
|
crc: CRC-32C
|
|
Header at file offset 4302372864
|
file format version: 13
|
update_seq: 1000000
|
purge_seq: 0
|
timestamp: 1970-01-19T14:30:05.060290+01:00
|
doc count: 1000000
|
deleted doc count: 0
|
data size: 3.885 GB
|
B-tree size: 58.38 MB
|
└── by-id tree: 27.38 MB
|
└── by-seqno tree: 31.00 MB
|
└── local size: 429 bytes
|
- level=majority
$ couch_dbinfo --local majority/0.couch.2
|
DB Info (majority/0.couch.2) - total disk size: 8.017 GB
|
crc: CRC-32C
|
|
Header at file offset 8608538624
|
file format version: 13
|
update_seq: 2000000
|
purge_seq: 0
|
timestamp: 1970-01-19T14:30:05.854591+01:00
|
doc count: 2000000
|
deleted doc count: 0
|
data size: 7.784 GB
|
B-tree size: 131.16 MB
|
└── by-id tree: 62.12 MB
|
└── by-seqno tree: 69.04 MB
|
└── local size: 436 bytes
|
Interestingly the fragmentation percentage (measured as (couch_docs_actual_data_size - couch_docs_data_size) / couch_docs_actual_data_size is around 3%. However, if manually run compaction on the "majority" bucket (via the UI) the disk space shrinks to almost half:
$ couch_dbinfo --local majority/0.couch.3
|
DB Info (majority/0.couch.3) - total disk size: 3.894 GB
|
crc: CRC-32C
|
|
Header at file offset 4181381120
|
file format version: 13
|
update_seq: 2000000
|
purge_seq: 0
|
timestamp: 1970-01-19T14:30:05.854591+01:00
|
doc count: 1000000
|
deleted doc count: 0
|
data size: 3.894 GB
|
B-tree size: 68.07 MB
|
└── by-id tree: 33.96 MB
|
└── by-seqno tree: 34.11 MB
|
└── local size: 434 bytes
|
What appears to be happening here is that the fragmentation calculation is incorrect - the on-disk Prepares (which have all been committed) are not counted as "overhead", and are instead treated as "valid" documents. This means auto-compaction hasn't run when it would be expected to. When it does run, however, these prepares can all be discarded and hence the file size after compaction is similar to the level=None case.
The reason why this happens is that couch_disk_data_size (size of "valid" data on disk) is calculated directly from couchstore's own count of how much data is in the current B-Tree root.
However, completed Prepares are still conceptually "valid" data from couchstore's POV - they are just documents with a different key prefix which happen to have a seqno below the high_completed_seqno. As such, couch_disk_data_size includes all prepares, outstanding and completed.
Addressing this with the current file format is likely to be difficult - the obvious (but expensive) method to accurately measure the size of completed prepares with the current couchstore schema would be to perform B-Tree seqno scan from 0 to the highCompletedSeqno, accumulating the size of all prepares found. However that's an O(N) operation where N = the number of completed prepares; so not really suitable for the ~1s polling which ns_server makes.