Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32107

4% throughput regression in KV : Pillowfight, 20/80 R/W, 256B binary items

    XMLWordPrintable

Details

    • Untriaged
    • Yes

    Description

      As reported by http://showfast.sc.couchbase.com/daily/#/history/KV%7CPillowfight,%2020/80%20R/W,%20256B%20binary%20items%7CMax%20Throughput%20(ops/sec) :

      Appears to have occurred between builds 6.5.0-1545 and 6.5.0-1561

      Build Throughput
      6.5.0-1561 1,916,444
      6.5.0-1545 2,009,289

      Attachments

        1. connhandlernew.txt
          4 kB
        2. connhandlerold.txt
          10 kB
        3. dcpconsumernew.txt
          27 kB
        4. dcpconsumerold.txt
          30 kB
        5. dcpproducernew.txt
          25 kB
        6. dcpproducerold.txt
          28 kB
        7. Screen Shot 2018-11-21 at 15.55.48.png
          Screen Shot 2018-11-21 at 15.55.48.png
          130 kB

        Issue Links

          For Gerrit Dashboard: MB-32107
          # Subject Branch Project Status CR V

          Activity

            Tested using build 2384. This cache line is contentious.

              -------------------------------------------------------------
                  1     1209      839     1863        0            0x7cec80
              -------------------------------------------------------------
                      96.77%   97.74%    0.00%    0.00%                0x30            0x48fcbd       296       145       161     2625        40  [.] update_topkeys   memcached  unique_ptr.h:147    0{40  56.3%    n/a}  1{40  43.7%    n/a}
                       3.23%    2.26%  100.00%    0.00%                0x38            0x48e1c0       402       182       225     5269        40  [.] mcbp_add_header  memcached  atomic_base.h:514   0{40  63.8%  57.8%}  1{40  36.2%  42.2%}
            

            The issue is in Bucket.

            (gdb) pahole Bucket
            /* 179352     */ struct Bucket {
            /*   0   40 */    std::mutex mutex
            /*  40   48 */    std::condition_variable cond
            /*  88    4 */    unsigned int clients
            /*  92    1 */    std::atomic<Bucket::State> state
            /*  93    1 */    Bucket::Type type
            /*  94  101 */    char [101] name
            /* XXX 40 bit hole, try to pack */
            /* 200   96 */    std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers
            /* 296   24 */    std::vector<thread_stats, std::allocator<thread_stats> > stats
            /* 320 176504 */    Timings timings
            /* 176824  672 */    TimingHistogram subjson_operation_times
            /* 177496    8 */    std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys
            /* 177504 1704 */    std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters
            /* 179208   64 */    ClusterConfiguration clusterConfiguration
            /* 179272    8 */    unsigned long max_document_size
            /* 179280   56 */    std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures
            /* 179336    8 */    EngineIface * engine
            /* 179344    8 */    DcpIface * bucketDcp
                          }
            

            Not printing cache lines because timings is ludicrously big, but the topkeys unique ptr shares a cache line with the responseCounters array. The responseCounters array is full of atomics that are used to track number of responses of a certain type. The first (and almost certainly the hottest) is SUCCESS. This is sharing with topkeys and as topkeys is only read in the sample, this is certainly false sharing. Will put this just after name. This will share a cache line with the end of name and the start of engine_event_handlers.

            (gdb) pahole Bucket
            /* 179352     */ struct Bucket {
            /*   0   40 */    std::mutex mutex
            /*  40   48 */    std::condition_variable cond
            /*  88    4 */    unsigned int clients
            /*  92    1 */    std::atomic<Bucket::State> state
            /*  93    1 */    Bucket::Type type
            /*  94  101 */    char [101] name
            /* XXX 40 bit hole, try to pack */
            /* 200    8 */    std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys
            /* 208   96 */    std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers
            /* 304   24 */    std::vector<thread_stats, std::allocator<thread_stats> > stats
            /* 328 176504 */    Timings timings
            /* 176832  672 */    TimingHistogram subjson_operation_times
            /* 177504 1704 */    std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters
            /* 179208   64 */    ClusterConfiguration clusterConfiguration
            /* 179272    8 */    unsigned long max_document_size
            /* 179280   56 */    std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures
            /* 179336    8 */    EngineIface * engine
            /* 179344    8 */    DcpIface * bucketDcp
                          }
            

            ben.huddleston Ben Huddleston added a comment - Tested using build 2384. This cache line is contentious. ------------------------------------------------------------- 1 1209 839 1863 0 0x7cec80 ------------------------------------------------------------- 96.77% 97.74% 0.00% 0.00% 0x30 0x48fcbd 296 145 161 2625 40 [.] update_topkeys memcached unique_ptr.h:147 0{40 56.3% n/a} 1{40 43.7% n/a} 3.23% 2.26% 100.00% 0.00% 0x38 0x48e1c0 402 182 225 5269 40 [.] mcbp_add_header memcached atomic_base.h:514 0{40 63.8% 57.8%} 1{40 36.2% 42.2%} The issue is in Bucket. (gdb) pahole Bucket /* 179352 */ struct Bucket { /* 0 40 */ std::mutex mutex /* 40 48 */ std::condition_variable cond /* 88 4 */ unsigned int clients /* 92 1 */ std::atomic<Bucket::State> state /* 93 1 */ Bucket::Type type /* 94 101 */ char [101] name /* XXX 40 bit hole, try to pack */ /* 200 96 */ std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers /* 296 24 */ std::vector<thread_stats, std::allocator<thread_stats> > stats /* 320 176504 */ Timings timings /* 176824 672 */ TimingHistogram subjson_operation_times /* 177496 8 */ std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys /* 177504 1704 */ std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters /* 179208 64 */ ClusterConfiguration clusterConfiguration /* 179272 8 */ unsigned long max_document_size /* 179280 56 */ std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures /* 179336 8 */ EngineIface * engine /* 179344 8 */ DcpIface * bucketDcp } Not printing cache lines because timings is ludicrously big, but the topkeys unique ptr shares a cache line with the responseCounters array. The responseCounters array is full of atomics that are used to track number of responses of a certain type. The first (and almost certainly the hottest) is SUCCESS. This is sharing with topkeys and as topkeys is only read in the sample, this is certainly false sharing. Will put this just after name. This will share a cache line with the end of name and the start of engine_event_handlers. (gdb) pahole Bucket /* 179352 */ struct Bucket { /* 0 40 */ std::mutex mutex /* 40 48 */ std::condition_variable cond /* 88 4 */ unsigned int clients /* 92 1 */ std::atomic<Bucket::State> state /* 93 1 */ Bucket::Type type /* 94 101 */ char [101] name /* XXX 40 bit hole, try to pack */ /* 200 8 */ std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys /* 208 96 */ std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers /* 304 24 */ std::vector<thread_stats, std::allocator<thread_stats> > stats /* 328 176504 */ Timings timings /* 176832 672 */ TimingHistogram subjson_operation_times /* 177504 1704 */ std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters /* 179208 64 */ ClusterConfiguration clusterConfiguration /* 179272 8 */ unsigned long max_document_size /* 179280 56 */ std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures /* 179336 8 */ EngineIface * engine /* 179344 8 */ DcpIface * bucketDcp }

            I think the rwlock cache line is normal contention when trying to acquire the rwlock guarding the streams map in the DcpProducer. The only other option would be for it to be contention on one of the locks in a StreamContainer, however we should see a contentious cache line for each lock (1 per StreamContainer i.e. per active vb) if that were the case. As a possible solution, we could "fix" this by removing the lock guarding the streams map. Each StreamContainer has it's own rwlock that we could rely on instead, splitting the contention over a number of cache lines equal to the number of active vbs. Would require a bit of rework to allow a StreamContainer to exist empty, and we would have to pre-populate the streams map with every vb and an empty StreamContainer.

            ben.huddleston Ben Huddleston added a comment - I think the rwlock cache line is normal contention when trying to acquire the rwlock guarding the streams map in the DcpProducer. The only other option would be for it to be contention on one of the locks in a StreamContainer, however we should see a contentious cache line for each lock (1 per StreamContainer i.e. per active vb) if that were the case. As a possible solution, we could "fix" this by removing the lock guarding the streams map. Each StreamContainer has it's own rwlock that we could rely on instead, splitting the contention over a number of cache lines equal to the number of active vbs. Would require a bit of rework to allow a StreamContainer to exist empty, and we would have to pre-populate the streams map with every vb and an empty StreamContainer.
            drigby Dave Rigby added a comment -

            Ben Huddleston This MB is getting pretty long (and has diverged from the original problem). Suggest you resolve this one (assuming the DcpProducer layout change did fix the initial regression and it's just we have new issues now) - and open new MB(s) for each specific new issue.

            drigby Dave Rigby added a comment - Ben Huddleston This MB is getting pretty long (and has diverged from the original problem). Suggest you resolve this one (assuming the DcpProducer layout change did fix the initial regression and it's just we have new issues now) - and open new MB(s) for each specific new issue.

            Created a few separate issues to track performance improvements I've identified as part of looking into perf c2c data and the DcpProducer in general. (MB-33151, MB-33153, MB-33157).

            The fix for this regression did bring performance back to the old value, however by the time it was merged the "long pole" in the DcpProducer has shifted so the fix no longer improves performance. Will fix the other issues mentioned and see if performance improves to previous values. MB-32389 is a bucket ticket for a general perf regression in kv_engine which may be of interest. Will resolve this issue as the regression has been fixed (the contentious cache line which was introduced is no longer present as of build 2202).

            ben.huddleston Ben Huddleston added a comment - Created a few separate issues to track performance improvements I've identified as part of looking into perf c2c data and the DcpProducer in general. ( MB-33151 , MB-33153 , MB-33157 ). The fix for this regression did bring performance back to the old value, however by the time it was merged the "long pole" in the DcpProducer has shifted so the fix no longer improves performance. Will fix the other issues mentioned and see if performance improves to previous values. MB-32389 is a bucket ticket for a general perf regression in kv_engine which may be of interest. Will resolve this issue as the regression has been fixed (the contentious cache line which was introduced is no longer present as of build 2202).

            Technically this regression is fixed. The symptom found in a perf c2c profile is no longer present. The long pole for perf in the write heavy test has shifted since this regression was fixed however so the perf numbers have not improved.

            ben.huddleston Ben Huddleston added a comment - Technically this regression is fixed. The symptom found in a perf c2c profile is no longer present. The long pole for perf in the write heavy test has shifted since this regression was fixed however so the perf numbers have not improved.

            People

              ben.huddleston Ben Huddleston
              drigby Dave Rigby
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty