Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36934

Investigate optimising HashTable::findInner when prepared StoredValues not present

    XMLWordPrintable

Details

    Description

      (Separated out from MB-36827)

      The hottest function in the profile for workers in this benchmark is HashTable::findInner - the actual function which searches the HashTable for a given key. This is used during both set() and get() to locate a StoredValue to operate on:

      Note this function has been changed as part of Sync Replication - it now returns pointers to committed and prepared StoredValues in the HashTable. Due to the way this is implemented, it must now scan the entire HashTable chain looking for committed and prepared StoredValues - this was not the case pre-SyncWrites (i.e. 6.0) where it would return immediately as soon as a match was found (only one match possible).

      Given this particular benchmark doesn't use SyncWrites, this extra walking of the HashTable chain is potentially expensive, as it basically involves walking a tiny linked lists which are likely not in CPU cache.

      I performed an experiment to see what happens if the function is essentially reverted to the Alice behaviour - return as soon as a Committed item is found, avoiding unnecessary linked list traversal (see http://review.couchbase.org/117972):

      @@ -327,6 +322,7 @@ HashTable::FindInnerResult HashTable::findInner(const DocKey& key) {
                   } else {
                       Expects(!foundCmt);
                       foundCmt = v;
      +                break;
                   }
               }
           }
      

      This was put into a toy build (based off build 4788) and run as: http://perf.jenkins.couchbase.com/job/triton/28607/

      This gives the following performance compared to builds 4722 and 4788:

      (Full comparison: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=triton_650-4772_access_bfcb&snapshot=triton_650-4788_access_9395&snapshot=triton_650-42002_access_6285 )

      Build Max Throughput (ops/s)
      6.5.0-4788 3,111,731
      6.5.0-42002 (toy: HashTable early return) 3,265,769

      i.e. 5% greater Max Throughput as measured by perfrunner, and overall higher throughput as seen in the graph.

      Note This change by itself is not valid; we obviously do need to keep looking for prepared values in case they do exist. However, it does highlight that there's significant performance to be gained if we can avoid unnecessarily walking the HashTable chain when we know there is no prepare present.

      The current implementation of that code dates to (http://review.couchbase.org/#/c/109320/4/engines/ep/src/hash_table.cc) - prior to that only the first StoredValue (prepared or committed) which existed in the HashTable chain was returned. The chain was arranged such that a prepared StoredValue (if one existed for a given key) was before any committed StoredValue - hence as soon as a committed StoredValue was found the search knew there was no prepared StoredValue for this key. The aforementioned patch removed that optimisation when expanding the ways the HashTable could be searched, however I believe conceptually we could restore this layout and hence in the case of a non-SyncWrite workload stop searching once a committed StoredValue was found (i.e. what the toy build does).

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-36934
          # Subject Branch Project Status CR V

          Activity

            wayne Wayne Siu added a comment -

            Reviewed in the Maintenance Meeting 01/20/2020, moving it out of 6.5.1.

            wayne Wayne Siu added a comment - Reviewed in the Maintenance Meeting 01/20/2020, moving it out of 6.5.1.
            drigby Dave Rigby added a comment -

            Adding into the "KV: Sync Rep v2" Epic given this is a performance improvement for Sync replication.

            drigby Dave Rigby added a comment - Adding into the "KV: Sync Rep v2" Epic given this is a performance improvement for Sync replication.

            Ephemeral/warmup makes this a little bit more complicated as we either keep completed pending StoredValues around (and do need to return them) or insert committed StoredValues after pending ones. I think we can get the same performance though if we do the following.

            If we are inserting any StoredValue and none of the others exist then we can place it at the head of the chain. If we are inserting a prepare and a commit exists that we can place it at the head of the chain, we don't need to place it directly before the committed StoredValue. If we are inserting a committed StoredValue when a pending StoredValue already exists (ephemeral/warmup) then we can pass the pending StoredValue (from the original lookup) to the HashTable and use that to chain from to prevent us from having to do another lookup.

            ben.huddleston Ben Huddleston added a comment - Ephemeral/warmup makes this a little bit more complicated as we either keep completed pending StoredValues around (and do need to return them) or insert committed StoredValues after pending ones. I think we can get the same performance though if we do the following. If we are inserting any StoredValue and none of the others exist then we can place it at the head of the chain. If we are inserting a prepare and a commit exists that we can place it at the head of the chain, we don't need to place it directly before the committed StoredValue. If we are inserting a committed StoredValue when a pending StoredValue already exists (ephemeral/warmup) then we can pass the pending StoredValue (from the original lookup) to the HashTable and use that to chain from to prevent us from having to do another lookup.
            ben.huddleston Ben Huddleston added a comment - - edited

            Benchmark results of the above change. Running a toy build now

            Before:

            EMEAUKMAC01:kv_engine benhuddleston$ ./ep_engine_benchmarks --benchmark_filter=HashTable
            2021-01-21 15:46:07
            Running ./ep_engine_benchmarks
            Run on (8 X 2900 MHz CPU s)
            CPU Caches:
              L1 Data 32 KiB (x4)
              L1 Instruction 32 KiB (x4)
              L2 Unified 256 KiB (x4)
              L3 Unified 8192 KiB (x1)
            Load Average: 2.09, 9.95, 11.06
            --------------------------------------------------------------------------------------------------------------------------------
            Benchmark                                                                      Time             CPU   Iterations UserCounters...
            --------------------------------------------------------------------------------------------------------------------------------
            HashTableBench/FindForRead/iterations:100000/threads:8                      66.8 ns          431 ns       800000 items_per_second=2.31992M/s
            HashTableBench/FindForWrite/iterations:100000/threads:8                     50.8 ns          332 ns       800000 items_per_second=3.0077M/s
            HashTableBench/Insert/iterations:100000/threads:8                            256 ns         1247 ns       800000 items_per_second=801.673k/s
            HashTableBench/Replace/iterations:100000/threads:8                           201 ns         1344 ns       800000 items_per_second=744.039k/s
            HashTableBench/Delete/iterations:100000/threads:8                           49.2 ns          369 ns       800000 items_per_second=2.71034M/s
            HashTableBench/MultiCollectionInsert/1/iterations:100000/threads:8           278 ns         1236 ns       800000 items_per_second=808.862k/s
            HashTableBench/MultiCollectionInsert/8/iterations:100000/threads:8           442 ns         1519 ns       800000 items_per_second=658.423k/s
            HashTableBench/MultiCollectionInsert/64/iterations:100000/threads:8          471 ns         1321 ns       800000 items_per_second=756.997k/s
            ^[HashTableBench/MultiCollectionInsert/512/iterations:100000/threads:8         495 ns         1644 ns       800000 items_per_second=608.388k/s
            HashTableBench/MultiCollectionInsert/1000/iterations:100000/threads:8        511 ns         1579 ns       800000 items_per_second=633.177k/s
            HashTableBench/HTStatsEpilogue/1/iterations:100000/threads:8                31.6 ns          233 ns       800000 items_per_second=4.28752M/s
            HashTableBench/HTStatsEpilogue/8/iterations:100000/threads:8                42.5 ns          294 ns       800000 items_per_second=3.40502M/s
            HashTableBench/HTStatsEpilogue/64/iterations:100000/threads:8               39.2 ns          279 ns       800000 items_per_second=3.58396M/s
            HashTableBench/HTStatsEpilogue/512/iterations:100000/threads:8              37.2 ns          275 ns       800000 items_per_second=3.63441M/s
            HashTableBench/HTStatsEpilogue/1000/iterations:100000/threads:8             39.0 ns          286 ns       800000 items_per_second=3.49357M/s
            

            After:

            EMEAUKMAC01:kv_engine benhuddleston$ ./ep_engine_benchmarks --benchmark_filter=HashTable
            2021-01-21 21:54:09
            Running ./ep_engine_benchmarks
            Run on (8 X 2900 MHz CPU s)
            CPU Caches:
              L1 Data 32 KiB (x4)
              L1 Instruction 32 KiB (x4)
              L2 Unified 256 KiB (x4)
              L3 Unified 8192 KiB (x1)
            Load Average: 9.93, 14.31, 10.02
            --------------------------------------------------------------------------------------------------------------------------------
            Benchmark                                                                      Time             CPU   Iterations UserCounters...
            --------------------------------------------------------------------------------------------------------------------------------
            HashTableBench/FindForRead/iterations:100000/threads:8                      64.3 ns          408 ns       800000 items_per_second=2.45284M/s
            HashTableBench/FindForWrite/iterations:100000/threads:8                     46.0 ns          303 ns       800000 items_per_second=3.29789M/s
            HashTableBench/Insert/iterations:100000/threads:8                            250 ns         1203 ns       800000 items_per_second=830.937k/s
            HashTableBench/Replace/iterations:100000/threads:8                           156 ns         1010 ns       800000 items_per_second=990.451k/s
            HashTableBench/Delete/iterations:100000/threads:8                           47.6 ns          355 ns       800000 items_per_second=2.81455M/s
            HashTableBench/MultiCollectionInsert/1/iterations:100000/threads:8           227 ns         1127 ns       800000 items_per_second=887.12k/s
            HashTableBench/MultiCollectionInsert/8/iterations:100000/threads:8           466 ns         1643 ns       800000 items_per_second=608.525k/s
            HashTableBench/MultiCollectionInsert/64/iterations:100000/threads:8          477 ns         1572 ns       800000 items_per_second=636.131k/s
            HashTableBench/MultiCollectionInsert/512/iterations:100000/threads:8         475 ns         1602 ns       800000 items_per_second=624.097k/s
            HashTableBench/MultiCollectionInsert/1000/iterations:100000/threads:8        486 ns         1529 ns       800000 items_per_second=654.168k/s
            HashTableBench/HTStatsEpilogue/1/iterations:100000/threads:8                31.6 ns          244 ns       800000 items_per_second=4.09968M/s
            HashTableBench/HTStatsEpilogue/8/iterations:100000/threads:8                45.8 ns          330 ns       800000 items_per_second=3.02843M/s
            HashTableBench/HTStatsEpilogue/64/iterations:100000/threads:8               34.4 ns          242 ns       800000 items_per_second=4.13734M/s
            HashTableBench/HTStatsEpilogue/512/iterations:100000/threads:8              34.4 ns          252 ns       800000 items_per_second=3.96126M/s
            HashTableBench/HTStatsEpilogue/1000/iterations:100000/threads:8             31.7 ns          222 ns       800000 items_per_second=4.50722M/s
            

            ben.huddleston Ben Huddleston added a comment - - edited Benchmark results of the above change. Running a toy build now Before: EMEAUKMAC01:kv_engine benhuddleston$ ./ep_engine_benchmarks --benchmark_filter=HashTable 2021-01-21 15:46:07 Running ./ep_engine_benchmarks Run on (8 X 2900 MHz CPU s) CPU Caches: L1 Data 32 KiB (x4) L1 Instruction 32 KiB (x4) L2 Unified 256 KiB (x4) L3 Unified 8192 KiB (x1) Load Average: 2.09, 9.95, 11.06 -------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------------------------------------- HashTableBench/FindForRead/iterations:100000/threads:8 66.8 ns 431 ns 800000 items_per_second=2.31992M/s HashTableBench/FindForWrite/iterations:100000/threads:8 50.8 ns 332 ns 800000 items_per_second=3.0077M/s HashTableBench/Insert/iterations:100000/threads:8 256 ns 1247 ns 800000 items_per_second=801.673k/s HashTableBench/Replace/iterations:100000/threads:8 201 ns 1344 ns 800000 items_per_second=744.039k/s HashTableBench/Delete/iterations:100000/threads:8 49.2 ns 369 ns 800000 items_per_second=2.71034M/s HashTableBench/MultiCollectionInsert/1/iterations:100000/threads:8 278 ns 1236 ns 800000 items_per_second=808.862k/s HashTableBench/MultiCollectionInsert/8/iterations:100000/threads:8 442 ns 1519 ns 800000 items_per_second=658.423k/s HashTableBench/MultiCollectionInsert/64/iterations:100000/threads:8 471 ns 1321 ns 800000 items_per_second=756.997k/s ^[HashTableBench/MultiCollectionInsert/512/iterations:100000/threads:8 495 ns 1644 ns 800000 items_per_second=608.388k/s HashTableBench/MultiCollectionInsert/1000/iterations:100000/threads:8 511 ns 1579 ns 800000 items_per_second=633.177k/s HashTableBench/HTStatsEpilogue/1/iterations:100000/threads:8 31.6 ns 233 ns 800000 items_per_second=4.28752M/s HashTableBench/HTStatsEpilogue/8/iterations:100000/threads:8 42.5 ns 294 ns 800000 items_per_second=3.40502M/s HashTableBench/HTStatsEpilogue/64/iterations:100000/threads:8 39.2 ns 279 ns 800000 items_per_second=3.58396M/s HashTableBench/HTStatsEpilogue/512/iterations:100000/threads:8 37.2 ns 275 ns 800000 items_per_second=3.63441M/s HashTableBench/HTStatsEpilogue/1000/iterations:100000/threads:8 39.0 ns 286 ns 800000 items_per_second=3.49357M/s After: EMEAUKMAC01:kv_engine benhuddleston$ ./ep_engine_benchmarks --benchmark_filter=HashTable 2021-01-21 21:54:09 Running ./ep_engine_benchmarks Run on (8 X 2900 MHz CPU s) CPU Caches: L1 Data 32 KiB (x4) L1 Instruction 32 KiB (x4) L2 Unified 256 KiB (x4) L3 Unified 8192 KiB (x1) Load Average: 9.93, 14.31, 10.02 -------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------------------------------------- HashTableBench/FindForRead/iterations:100000/threads:8 64.3 ns 408 ns 800000 items_per_second=2.45284M/s HashTableBench/FindForWrite/iterations:100000/threads:8 46.0 ns 303 ns 800000 items_per_second=3.29789M/s HashTableBench/Insert/iterations:100000/threads:8 250 ns 1203 ns 800000 items_per_second=830.937k/s HashTableBench/Replace/iterations:100000/threads:8 156 ns 1010 ns 800000 items_per_second=990.451k/s HashTableBench/Delete/iterations:100000/threads:8 47.6 ns 355 ns 800000 items_per_second=2.81455M/s HashTableBench/MultiCollectionInsert/1/iterations:100000/threads:8 227 ns 1127 ns 800000 items_per_second=887.12k/s HashTableBench/MultiCollectionInsert/8/iterations:100000/threads:8 466 ns 1643 ns 800000 items_per_second=608.525k/s HashTableBench/MultiCollectionInsert/64/iterations:100000/threads:8 477 ns 1572 ns 800000 items_per_second=636.131k/s HashTableBench/MultiCollectionInsert/512/iterations:100000/threads:8 475 ns 1602 ns 800000 items_per_second=624.097k/s HashTableBench/MultiCollectionInsert/1000/iterations:100000/threads:8 486 ns 1529 ns 800000 items_per_second=654.168k/s HashTableBench/HTStatsEpilogue/1/iterations:100000/threads:8 31.6 ns 244 ns 800000 items_per_second=4.09968M/s HashTableBench/HTStatsEpilogue/8/iterations:100000/threads:8 45.8 ns 330 ns 800000 items_per_second=3.02843M/s HashTableBench/HTStatsEpilogue/64/iterations:100000/threads:8 34.4 ns 242 ns 800000 items_per_second=4.13734M/s HashTableBench/HTStatsEpilogue/512/iterations:100000/threads:8 34.4 ns 252 ns 800000 items_per_second=3.96126M/s HashTableBench/HTStatsEpilogue/1000/iterations:100000/threads:8 31.7 ns 222 ns 800000 items_per_second=4.50722M/s

            Whilst the benchmarks show some performance gain the perf runs of this toy build against the triton daily tests show no real difference (all within the typical margin of error)

            Test 7.0.0-4225 Toy Build
            KV : Pillowfight, 20/80 R/W, 256B binary items 3,273,871 3,215,584
            KV : Pillowfight, 80/20 R/W, 256B binary items 4,756,763 4,822,130
            KV : Pillowfight, 50/50 R/W, 256B binary items, Ephemeral 4,057,900 4,060,273
            ben.huddleston Ben Huddleston added a comment - Whilst the benchmarks show some performance gain the perf runs of this toy build against the triton daily tests show no real difference (all within the typical margin of error) Test 7.0.0-4225 Toy Build KV : Pillowfight, 20/80 R/W, 256B binary items 3,273,871 3,215,584 KV : Pillowfight, 80/20 R/W, 256B binary items 4,756,763 4,822,130 KV : Pillowfight, 50/50 R/W, 256B binary items, Ephemeral 4,057,900 4,060,273
            ben.huddleston Ben Huddleston added a comment - The java-dcp-client test may show an improvement with this change. See comment here - https://issues.couchbase.com/browse/MB-45071?focusedCommentId=487132&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-487132 .

            A toy build and run of the java-dcp throughput test didn't result in a higher throughput value (175,786). http://perf.jenkins.couchbase.com/job/athena/1559. Suspect that the worker is the bottleneck in the test at the moment, rather than the backfill thread. Couldn't get a perf profile as we didn't appear to install the debug symbols for the toy build. Raised CBPS-910 for that issue.

            ben.huddleston Ben Huddleston added a comment - A toy build and run of the java-dcp throughput test didn't result in a higher throughput value (175,786). http://perf.jenkins.couchbase.com/job/athena/1559 . Suspect that the worker is the bottleneck in the test at the moment, rather than the backfill thread. Couldn't get a perf profile as we didn't appear to install the debug symbols for the toy build. Raised CBPS-910 for that issue.

            People

              ben.huddleston Ben Huddleston
              drigby Dave Rigby
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                PagerDuty