Details
-
Improvement
-
Resolution: Fixed
-
Major
-
4.5.0
Description
Couchstore currently performs IO buffering on reads and writes to the filesystem (see http://src.couchbase.org/source/xref/trunk/couchstore/src/iobuffer.cc).
This has two benefits - it coalesces reads / writes to the same buffer size (reducing the number of syscalls), and caches a (limited) number of blocks in the file, so any subsequent accesses to the same block to not need to go to the OS at all.
The current read buffer size is 8192 bytes. I'm not sure what the exact reason for this is, however it may be to match the underlying couchstore file format block size.
One observed issue with an 8K buffer is that this will typically result in reading 2x 4kB pages from the OS buffer cache. This is unlikely to be an issue if there is ample buffer cache - say if the couchstore file completely resides in RAM - but in DGM scenarios this may not be the case. In such instances we are potentially doubling the number of disk reads (reading two adjacent 4kB) pages, if the requested data could be retrieved from a single page.
Given in DGM scenarios buffer cache will be a precious resource, we should be aiming to maximum the value we got from it (i.e. trying to keep as much of the B-Tree in buffer cache as we can) - and by always reading 8K we are potentially "polluting" the buffer cache with pages which aren't relevant to the B-Tree.
We should create some couchstore benchmarks to see what effect changing to 4K buffers has (say change from 8x 8kB buffers to 16x 4kB buffers).
Benchmarking Notes
These benchmarks should attempt to model DGM scenarios such that the buffer cache cannot fulfil all reads. This could either be done by creating couchstore files which (significantly) exceed available RAM, or by instructing the OS to not cache out test files - see nocache for an external tool which can do this (and also the relevant libc calls to control this on Linux).
We should also look at access patterns:
- sequential may be of some interest, although is unlikely to represent real-world usage. (Note that compaction, while sequential to a degree uses a totally different code path and given it's a background task is probably not a major concern).
- random is perhaps more interesting, however few accesses are truly random - especially given the hottest data will be held in the HashTable.
- Zipfian (https://en.wikipedia.org/wiki/Zipf%27s_law) distribution is probably more representative of real-world data access, and so would be interesting to look at.