Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36424

couchstore node flushing doesn't respect node quota size

    XMLWordPrintable

Details

    • Triaged
    • Yes
    • KV-Engine Mad-Hatter GA

    Description

      Summary

      When couchstore writes out modified node elements during document saves, it it supposed to limit the number of bytes written in a single node to a _chunk_threshold - by default 1279 bytes. If the node is larger than the limit it should be split into multiple sibling nodes.

      However, this limit is not respected, resulting in overly-large nodes being written out. In the case of the by-seqno B-Tree (which always writes values to the rightmost leaf as seqnos are increasing), it results in all leaf elements residing in a single leaf node. Moreover, this means that adding another element to the B-Tree effectively re-writes the entire tree, resulting in massive Write Amplification.

      Steps to Reproduce

      1. Start 2-node cluster run (single replica):

        ./cluster_run --nodes=2
        

      2. Start a SyncWrite workload; single threaded client updating the same 100,000 items (each key 10 times) with level=persistMajority:

        ./engines/ep/management/sync_repl.py localhost:12000 Administrator asdasd default loop_bulk_setD key value 100000 1000000 3
        

      3. Observe the op/s and Write Amplification

      Expected Results

      • Op/s should be broadly constant (given both key and seqno B-Tree should have a constant number of items in them).
      • Write amplification should also be broadly constant.

      Actual Results

      • The op/s quickly drops from a peak of ~600 down to 150:
      • The Write Amplification increase (corresponding with the top in op/s) from 5.6x, up to 743x .
      • Both op/s and Write Amplification temporarily recover when compaction occurs, but same pattern is observed over time:

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            drigby Dave Rigby added a comment - - edited

            Problem is in the handling of the node quota in flush_mr_partial() :-

            static couchstore_error_t flush_mr_partial(couchfile_modify_result *res, size_t mr_quota)
            {
                ...
                //We don't care that we've reached mr_quota if we haven't written out
                //at least two items and we're not writing a leaf node.
                while (i != NULL && (mr_quota > 0 || (itmcount < 2 && res->node_type == KP_NODE))) {
                    dst = static_cast<char*>(write_kv(dst, i->key, i->data));
                    if (i->pointer) {
                        subtreesize += i->pointer->subtreesize;
                    }
                    mr_quota -= i->key.size + i->data.size + sizeof(raw_kv_length);
                    final_key = i->key;
                    i = i->next;
                    res->count--;
                    itmcount++;
                }
            

            Note how one of the while loop terminating conditions is if the mr_quota goes below zero (i.e. quota has been consumed). mr_quota is provided as an input parameter, and then decremented inside the while loop.

            However, the type of mr_quota is an unsigned size_t - as such when it is decremented in the while loop after writing the key/value pair it goes to a number close to SIZE_T_MAX, not a negative value. As a result the while loop is not exited.

            This bug has been present since July 2012 - was introduced via http://review.couchbase.org/#/c/18008/

            To illustrate the problem, here's couch_dbdump after a simple test program to load 16 keys into a couchstore file. couchstore was modified to artificially limit maximum node size to 50B (down from 1.2K) to allow the problem to be shown with just a few docs.

            First is the byid B-Tree - which has a sensible tree structure:

            Dumping "test.couch":
            + (843) {00000000 10000000 00000000 000000d0 }
              + (361) {00000000 09000000 00000000 00000075 }
                + (43) {00000000 01000000 00000000 0000000d }
                  * (13)  id:(collection:0x6b) ey_0
                + (43) {00000000 01000000 00000000 0000000d }
                  * (13)  id:(collection:0x6b) ey_1
                + (89) {00000000 04000000 00000000 00000034 }
                  * (13)  id:(collection:0x6b) ey_10
                  * (13)  id:(collection:0x6b) ey_11
                  * (13)  id:(collection:0x6b) ey_12
                  * (13)  id:(collection:0x6b) ey_13
                + (78) {00000000 03000000 00000000 00000027 }
                  * (13)  id:(collection:0x6b) ey_14
                  * (13)  id:(collection:0x6b) ey_15
                  * (13)  id:(collection:0x6b) ey_2
              + (92) {00000000 01000000 00000000 0000000d }
                + (43) {00000000 01000000 00000000 0000000d }
                  * (13)  id:(collection:0x6b) ey_3
              + (293) {00000000 06000000 00000000 0000004e }
                + (43) {00000000 01000000 00000000 0000000d }
                  * (13)  id:(collection:0x6b) ey_4
                + (43) {00000000 01000000 00000000 0000000d }
                  * (13)  id:(collection:0x6b) ey_5
                + (43) {00000000 01000000 00000000 0000000d }
                  * (13)  id:(collection:0x6b) ey_6
                + (73) {00000000 03000000 00000000 00000027 }
                  * (13)  id:(collection:0x6b) ey_7
                  * (13)  id:(collection:0x6b) ey_8
                  * (13)  id:(collection:0x6b) ey_9
             
            Total docs: 16
            

            Secondly the seqno B-Tree (showing the bug) - note how all elements are contained in a single leaf node. Any additon / removal from the tree requires re-writing the entire leaf node:

            Dumping "test.couch":
            + (254) {00000000 10}
              * (13) #1  id:(collection:0x6b) ey_0
              * (13) #2  id:(collection:0x6b) ey_1
              * (13) #3  id:(collection:0x6b) ey_2
              * (13) #4  id:(collection:0x6b) ey_3
              * (13) #5  id:(collection:0x6b) ey_4
              * (13) #6  id:(collection:0x6b) ey_5
              * (13) #7  id:(collection:0x6b) ey_6
              * (13) #8  id:(collection:0x6b) ey_7
              * (13) #9  id:(collection:0x6b) ey_8
              * (13) #10  id:(collection:0x6b) ey_9
              * (13) #11  id:(collection:0x6b) ey_10
              * (13) #12  id:(collection:0x6b) ey_11
              * (13) #13  id:(collection:0x6b) ey_12
              * (13) #14  id:(collection:0x6b) ey_13
              * (13) #15  id:(collection:0x6b) ey_14
              * (13) #16  id:(collection:0x6b) ey_15
             
            Total docs: 16
            

            drigby Dave Rigby added a comment - - edited Problem is in the handling of the node quota in flush_mr_partial() :- static couchstore_error_t flush_mr_partial(couchfile_modify_result *res, size_t mr_quota) { ... //We don't care that we've reached mr_quota if we haven't written out //at least two items and we're not writing a leaf node. while (i != NULL && (mr_quota > 0 || (itmcount < 2 && res->node_type == KP_NODE))) { dst = static_cast < char *>(write_kv(dst, i->key, i->data)); if (i->pointer) { subtreesize += i->pointer->subtreesize; } mr_quota -= i->key.size + i->data.size + sizeof (raw_kv_length); final_key = i->key; i = i->next; res->count--; itmcount++; } Note how one of the while loop terminating conditions is if the mr_quota goes below zero (i.e. quota has been consumed). mr_quota is provided as an input parameter, and then decremented inside the while loop. However, the type of mr_quota is an unsigned size_t - as such when it is decremented in the while loop after writing the key/value pair it goes to a number close to SIZE_T_MAX, not a negative value. As a result the while loop is not exited. This bug has been present since July 2012 - was introduced via http://review.couchbase.org/#/c/18008/ To illustrate the problem, here's couch_dbdump after a simple test program to load 16 keys into a couchstore file. couchstore was modified to artificially limit maximum node size to 50B (down from 1.2K) to allow the problem to be shown with just a few docs. First is the byid B-Tree - which has a sensible tree structure: Dumping "test.couch": + (843) {00000000 10000000 00000000 000000d0 } + (361) {00000000 09000000 00000000 00000075 } + (43) {00000000 01000000 00000000 0000000d } * (13) id:(collection:0x6b) ey_0 + (43) {00000000 01000000 00000000 0000000d } * (13) id:(collection:0x6b) ey_1 + (89) {00000000 04000000 00000000 00000034 } * (13) id:(collection:0x6b) ey_10 * (13) id:(collection:0x6b) ey_11 * (13) id:(collection:0x6b) ey_12 * (13) id:(collection:0x6b) ey_13 + (78) {00000000 03000000 00000000 00000027 } * (13) id:(collection:0x6b) ey_14 * (13) id:(collection:0x6b) ey_15 * (13) id:(collection:0x6b) ey_2 + (92) {00000000 01000000 00000000 0000000d } + (43) {00000000 01000000 00000000 0000000d } * (13) id:(collection:0x6b) ey_3 + (293) {00000000 06000000 00000000 0000004e } + (43) {00000000 01000000 00000000 0000000d } * (13) id:(collection:0x6b) ey_4 + (43) {00000000 01000000 00000000 0000000d } * (13) id:(collection:0x6b) ey_5 + (43) {00000000 01000000 00000000 0000000d } * (13) id:(collection:0x6b) ey_6 + (73) {00000000 03000000 00000000 00000027 } * (13) id:(collection:0x6b) ey_7 * (13) id:(collection:0x6b) ey_8 * (13) id:(collection:0x6b) ey_9   Total docs: 16 Secondly the seqno B-Tree (showing the bug) - note how all elements are contained in a single leaf node. Any additon / removal from the tree requires re-writing the entire leaf node: Dumping "test.couch": + (254) {00000000 10} * (13) #1 id:(collection:0x6b) ey_0 * (13) #2 id:(collection:0x6b) ey_1 * (13) #3 id:(collection:0x6b) ey_2 * (13) #4 id:(collection:0x6b) ey_3 * (13) #5 id:(collection:0x6b) ey_4 * (13) #6 id:(collection:0x6b) ey_5 * (13) #7 id:(collection:0x6b) ey_6 * (13) #8 id:(collection:0x6b) ey_7 * (13) #9 id:(collection:0x6b) ey_8 * (13) #10 id:(collection:0x6b) ey_9 * (13) #11 id:(collection:0x6b) ey_10 * (13) #12 id:(collection:0x6b) ey_11 * (13) #13 id:(collection:0x6b) ey_12 * (13) #14 id:(collection:0x6b) ey_13 * (13) #15 id:(collection:0x6b) ey_14 * (13) #16 id:(collection:0x6b) ey_15   Total docs: 16
            drigby Dave Rigby added a comment - - edited

            Interestingly, enabling the UBSan switch -fsanitize=unsigned-integer-overflow flags this bug:

            $ /Users/dave/repos/couchbase/server/source/build-asan-ubsan/couchstore/couchstore_gtest
            [==========] Running 40 tests from 4 test cases.
            [----------] Global test environment set-up.
            [----------] 24 tests from CouchstoreTest
            ...
            [ RUN      ] CouchstoreTest.changes_no_dups
            ../couchstore/src/btree_modify.cc:244:18: runtime error: unsigned integer overflow: 46 - 51 cannot be represented in type 'unsigned long'
            SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../couchstore/src/btree_modify.cc:244:18 in 
            Abort trap: 6
            

            (btree_modify.cc line 244 is the decrement of mr_quota:

            244
                    mr_quota -= i->key.size + i->data.size + sizeof(raw_kv_length);
            

            drigby Dave Rigby added a comment - - edited Interestingly, enabling the UBSan switch -fsanitize=unsigned-integer-overflow flags this bug: $ /Users/dave/repos/couchbase/server/source/build-asan-ubsan/couchstore/couchstore_gtest [==========] Running 40 tests from 4 test cases. [----------] Global test environment set-up. [----------] 24 tests from CouchstoreTest ... [ RUN ] CouchstoreTest.changes_no_dups ../couchstore/src/btree_modify.cc:244:18: runtime error: unsigned integer overflow: 46 - 51 cannot be represented in type 'unsigned long' SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../couchstore/src/btree_modify.cc:244:18 in Abort trap: 6 (btree_modify.cc line 244 is the decrement of mr_quota : 244 mr_quota -= i->key.size + i->data.size + sizeof (raw_kv_length);
            drigby Dave Rigby added a comment -

            Reverting the type of mr_quota back to int gives the following op/s for the same test:

            • The op/s is broadly constant at ~800 op/s - the drop to zero is during compaction.
            • The Write Amplification is much more constant, ranging from 5.2x to 11.3x
            drigby Dave Rigby added a comment - Reverting the type of mr_quota back to int gives the following op/s for the same test: The op/s is broadly constant at ~800 op/s - the drop to zero is during compaction. The Write Amplification is much more constant, ranging from 5.2x to 11.3x

            Build couchbase-server-6.5.0-4549 contains couchstore commit c879f0d with commit message:
            MB-36424: Ensure flush_mr_partial() obeys node size quota

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4549 contains couchstore commit c879f0d with commit message: MB-36424 : Ensure flush_mr_partial() obeys node size quota

            Build couchbase-server-7.0.0-1009 contains couchstore commit c879f0d with commit message:
            MB-36424: Ensure flush_mr_partial() obeys node size quota

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1009 contains couchstore commit c879f0d with commit message: MB-36424 : Ensure flush_mr_partial() obeys node size quota

            The tests done by development as part of the fix verified the fix.

            srinath.duvuru Srinath Duvuru added a comment - The tests done by development as part of the fix verified the fix.

            People

              drigby Dave Rigby
              drigby Dave Rigby
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty