Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
3.1.6, 4.1.2, 4.5.1, 4.6.5, 5.0.1, 6.5.0, 5.1.3, 6.0.2, 6.0.3, 5.5.5
-
Triaged
-
Yes
-
KV-Engine Mad-Hatter GA
Description
Summary
When couchstore writes out modified node elements during document saves, it it supposed to limit the number of bytes written in a single node to a _chunk_threshold - by default 1279 bytes. If the node is larger than the limit it should be split into multiple sibling nodes.
However, this limit is not respected, resulting in overly-large nodes being written out. In the case of the by-seqno B-Tree (which always writes values to the rightmost leaf as seqnos are increasing), it results in all leaf elements residing in a single leaf node. Moreover, this means that adding another element to the B-Tree effectively re-writes the entire tree, resulting in massive Write Amplification.
Steps to Reproduce
- Start 2-node cluster run (single replica):
./cluster_run --nodes=2
- Start a SyncWrite workload; single threaded client updating the same 100,000 items (each key 10 times) with level=persistMajority:
./engines/ep/management/sync_repl.py localhost:12000 Administrator asdasd default loop_bulk_setD key value 100000 1000000 3
- Observe the op/s and Write Amplification
Expected Results
- Op/s should be broadly constant (given both key and seqno B-Tree should have a constant number of items in them).
- Write amplification should also be broadly constant.
Actual Results
- The op/s quickly drops from a peak of ~600 down to 150:
- The Write Amplification increase (corresponding with the top in op/s) from 5.6x, up to 743x
.
- Both op/s and Write Amplification temporarily recover when compaction occurs, but same pattern is observed over time:
Problem is in the handling of the node quota in flush_mr_partial() :-
{
...
subtreesize += i->pointer->subtreesize;
}
final_key = i->key;
i = i->next;
res->count--;
itmcount++;
}
Note how one of the while loop terminating conditions is if the mr_quota goes below zero (i.e. quota has been consumed). mr_quota is provided as an input parameter, and then decremented inside the while loop.
However, the type of mr_quota is an unsigned size_t - as such when it is decremented in the while loop after writing the key/value pair it goes to a number close to SIZE_T_MAX, not a negative value. As a result the while loop is not exited.
This bug has been present since July 2012 - was introduced via http://review.couchbase.org/#/c/18008/
To illustrate the problem, here's couch_dbdump after a simple test program to load 16 keys into a couchstore file. couchstore was modified to artificially limit maximum node size to 50B (down from 1.2K) to allow the problem to be shown with just a few docs.
First is the byid B-Tree - which has a sensible tree structure:
Dumping "test.couch":
+ (843) {00000000 10000000 00000000 000000d0 }
+ (361) {00000000 09000000 00000000 00000075 }
+ (43) {00000000 01000000 00000000 0000000d }
* (13) id:(collection:0x6b) ey_0
+ (43) {00000000 01000000 00000000 0000000d }
* (13) id:(collection:0x6b) ey_1
+ (89) {00000000 04000000 00000000 00000034 }
* (13) id:(collection:0x6b) ey_10
* (13) id:(collection:0x6b) ey_11
* (13) id:(collection:0x6b) ey_12
* (13) id:(collection:0x6b) ey_13
+ (78) {00000000 03000000 00000000 00000027 }
* (13) id:(collection:0x6b) ey_14
* (13) id:(collection:0x6b) ey_15
* (13) id:(collection:0x6b) ey_2
+ (92) {00000000 01000000 00000000 0000000d }
+ (43) {00000000 01000000 00000000 0000000d }
* (13) id:(collection:0x6b) ey_3
+ (293) {00000000 06000000 00000000 0000004e }
+ (43) {00000000 01000000 00000000 0000000d }
* (13) id:(collection:0x6b) ey_4
+ (43) {00000000 01000000 00000000 0000000d }
* (13) id:(collection:0x6b) ey_5
+ (43) {00000000 01000000 00000000 0000000d }
* (13) id:(collection:0x6b) ey_6
+ (73) {00000000 03000000 00000000 00000027 }
* (13) id:(collection:0x6b) ey_7
* (13) id:(collection:0x6b) ey_8
* (13) id:(collection:0x6b) ey_9
Total docs: 16
Secondly the seqno B-Tree (showing the bug) - note how all elements are contained in a single leaf node. Any additon / removal from the tree requires re-writing the entire leaf node:
Dumping "test.couch":
+ (254) {00000000 10}
* (13) #1 id:(collection:0x6b) ey_0
* (13) #2 id:(collection:0x6b) ey_1
* (13) #3 id:(collection:0x6b) ey_2
* (13) #4 id:(collection:0x6b) ey_3
* (13) #5 id:(collection:0x6b) ey_4
* (13) #6 id:(collection:0x6b) ey_5
* (13) #7 id:(collection:0x6b) ey_6
* (13) #8 id:(collection:0x6b) ey_7
* (13) #9 id:(collection:0x6b) ey_8
* (13) #10 id:(collection:0x6b) ey_9
* (13) #11 id:(collection:0x6b) ey_10
* (13) #12 id:(collection:0x6b) ey_11
* (13) #13 id:(collection:0x6b) ey_12
* (13) #14 id:(collection:0x6b) ey_13
* (13) #15 id:(collection:0x6b) ey_14
* (13) #16 id:(collection:0x6b) ey_15
Total docs: 16