Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32107

4% throughput regression in KV : Pillowfight, 20/80 R/W, 256B binary items

    XMLWordPrintable

Details

    • Untriaged
    • Yes

    Description

      As reported by http://showfast.sc.couchbase.com/daily/#/history/KV%7CPillowfight,%2020/80%20R/W,%20256B%20binary%20items%7CMax%20Throughput%20(ops/sec) :

      Appears to have occurred between builds 6.5.0-1545 and 6.5.0-1561

      Build Throughput
      6.5.0-1561 1,916,444
      6.5.0-1545 2,009,289

      Attachments

        1. connhandlernew.txt
          4 kB
        2. connhandlerold.txt
          10 kB
        3. dcpconsumernew.txt
          27 kB
        4. dcpconsumerold.txt
          30 kB
        5. dcpproducernew.txt
          25 kB
        6. dcpproducerold.txt
          28 kB
        7. Screen Shot 2018-11-21 at 15.55.48.png
          Screen Shot 2018-11-21 at 15.55.48.png
          130 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            drigby Dave Rigby added a comment -

            Changelog between 1545 and 1561 is pretty long - will need binary chopping to isolate: http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-1545&tob=6.5.0-1561

            drigby Dave Rigby added a comment - Changelog between 1545 and 1561 is pretty long - will need binary chopping to isolate: http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-1545&tob=6.5.0-1561
            drigby Dave Rigby added a comment - - edited

            Ben Huddleston Could you take a look? Note you can trigger intermediate builds (assuming they exist) by re-running the "Logs" Jenkins link and modifying the build number (e.g. http://perf.jenkins.couchbase.com/job/triton/20192/)

            drigby Dave Rigby added a comment - - edited Ben Huddleston Could you take a look? Note you can trigger intermediate builds (assuming they exist) by re-running the "Logs" Jenkins link and modifying the build number (e.g. http://perf.jenkins.couchbase.com/job/triton/20192/ )
            ben.huddleston Ben Huddleston added a comment - - edited

            I ran the test against most of the intermediate builds, a few of the builds failed the test, but we have more information. Unfortunately the stats are a bit noisy, possible variations of at least 40k in ops/s.

            From the latest set of runs:

            Build ops/s
            6.5.0-1561 1,954,656
            6.5.0-1560 1,946,356
            6.5.0-1559 1,951,419
            6.5.0-1558 1,938,901
            6.5.0-1553 1,995,609
            6.5.0-1552 1,997,253
            6.5.0-1551 1,979,985
            6.5.0-1549 2,012,679
            6.5.0-1545 2,009,289

            This would indicate the regression was likely introduced between build 6.5.0-1533 and 6.5.0-1588. Unfortunately the builds in between fail the test. Changelog between the two builds - http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-1553&tob=6.5.0-1558.

            Nothing stands out as an obvious performance regression here, many of the changes between these builds are for infrequently run commands that would not affect this test. Ran through the list with Dave Rigby and http://review.couchbase.org/#/c/101191/ looks like a potential candidate for such a regression, the theory being that this could be a cache alignment issue due to the change in size of the ConnHandler class. The ConnHandler class is abstract, so we should also look at the DcpConsumer and DcpProducer classes which inherit from it. See attached files for full class layouts. connhandlernew.txt connhandlerold.txt dcpconsumernew.txt dcpconsumerold.txt dcpproducernew.txt dcpproducerold.txt

            These layouts were dumped using the following command after inserting a global variable that gets the size of the desired class. Not all of the includes may be necessary. They were dumped in macOS not Linux.

            clang++ -Xclang -fdump-record-layouts -I ../../../../cmake-build-debug/kv_engine/engines/ep/src/ -I ../src -I ../../../utilities/ -I ../../../../cmake-build-debug/tlm/deps/boost.exploded/include/ -I ../../../../cmake-build-debug/tlm/deps/json.exploded/include/ -I ../../../../third_party/HdrHistogram_c/src/ -I ../../../include/ -I ../../../../cmake-build-debug/platform/include/ -I ../../../../third_party/gsl-lite/include/ -I ../../../../platform/include/ -I ../../../../cmake-build-debug/kv_engine/include/ -I ../../../../third_party/spdlognew/include/ -I ../../../ -I ../../../../phosphor/include/ -std=c++14 dcp/producer.h
            

            const static size_t size = sizeof(DcpProducer);
            

             

            Class Old Size New Size
            ConnHandler 176 88
            DcpConsumer 904 816
            DcpProducer 1280 1152

            Interestingly, I found that DcpProducer decreased by a size different to that of ConnHandler and DcpConsumer (128 bytes vs 88 bytes). After some digging, I found that the cb::CachelinePadded<struct DcpProducer::CheckpointCreator> was aligned differently, which led me to MB-30488. Will try to investigate further using the tools used in MB-30488.

            ben.huddleston Ben Huddleston added a comment - - edited I ran the test against most of the intermediate builds, a few of the builds failed the test, but we have more information. Unfortunately the stats are a bit noisy, possible variations of at least 40k in ops/s. From the latest set of runs: Build ops/s 6.5.0-1561 1,954,656 6.5.0-1560 1,946,356 6.5.0-1559 1,951,419 6.5.0-1558 1,938,901 6.5.0-1553 1,995,609 6.5.0-1552 1,997,253 6.5.0-1551 1,979,985 6.5.0-1549 2,012,679 6.5.0-1545 2,009,289 This would indicate the regression was likely introduced between build 6.5.0-1533 and 6.5.0-1588. Unfortunately the builds in between fail the test. Changelog between the two builds - http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-1553&tob=6.5.0-1558 . Nothing stands out as an obvious performance regression here, many of the changes between these builds are for infrequently run commands that would not affect this test. Ran through the list with Dave Rigby and http://review.couchbase.org/#/c/101191/ looks like a potential candidate for such a regression, the theory being that this could be a cache alignment issue due to the change in size of the ConnHandler class. The ConnHandler class is abstract, so we should also look at the DcpConsumer and DcpProducer classes which inherit from it. See attached files for full class layouts. connhandlernew.txt connhandlerold.txt dcpconsumernew.txt dcpconsumerold.txt dcpproducernew.txt dcpproducerold.txt These layouts were dumped using the following command after inserting a global variable that gets the size of the desired class. Not all of the includes may be necessary. They were dumped in macOS not Linux. clang++ -Xclang -fdump-record-layouts -I ../../../../cmake-build-debug/kv_engine/engines/ep/src/ -I ../src -I ../../../utilities/ -I ../../../../cmake-build-debug/tlm/deps/boost.exploded/include/ -I ../../../../cmake-build-debug/tlm/deps/json.exploded/include/ -I ../../../../third_party/HdrHistogram_c/src/ -I ../../../include/ -I ../../../../cmake-build-debug/platform/include/ -I ../../../../third_party/gsl-lite/include/ -I ../../../../platform/include/ -I ../../../../cmake-build-debug/kv_engine/include/ -I ../../../../third_party/spdlognew/include/ -I ../../../ -I ../../../../phosphor/include/ -std=c++14 dcp/producer.h const static size_t size = sizeof(DcpProducer);   Class Old Size New Size ConnHandler 176 88 DcpConsumer 904 816 DcpProducer 1280 1152 Interestingly, I found that DcpProducer decreased by a size different to that of ConnHandler and DcpConsumer (128 bytes vs 88 bytes). After some digging, I found that the cb::CachelinePadded<struct DcpProducer::CheckpointCreator> was aligned differently, which led me to MB-30488 . Will try to investigate further using the tools used in MB-30488 .
            ben.huddleston Ben Huddleston added a comment - - edited

            Spent a chunk of time looking at this without updating here. Lets start with some numbers.

             1553 ops/s 1558 ops/s
            1962776 1965001
            1956093 1885716
            1952747 1938901
            1979286 1935830
            2017269 1932842
            1975075 1896938
            1995609 1930222
            1952298 1906424
            Averages ========
            1,973,894.125 1,923,984.25

            So we have a drop of 50k ops/s between 1553 and 1558 over the 8 runs performed. A refactor to ConnHandler was the suspect commit (http://review.couchbase.org/#/c/101191/). Created a toy build to test this out.

             1553 ops/s 1558 ops/s Suspect Commit ops/s
            1962776 1965001 1931517
            1956093 1885716 1931540
            1952747 1938901 1933941
            1979286 1935830 1959403
            2017269 1932842 1934353
            1975075 1896938
            1995609 1930222
            1952298 1906424
            Averages ======== ========
            1,973,894.125 1,923,984.25 1,938,150.8

            Only did 5 runs of this, but it looks like the regression is here. I found a fix for this after playing about with the layout of ConnHandler. Spotted that previously in the DcpProducer the shared from this ptr was sharing a cache line with the paused reason string but not the mutex. This was rolled into an atomic that now sits in a cache line with the start of the DcpProducer. Decided to start by separating these. Couple of ways of doing this, can dump the layout of a class using clang or gcc if you modify a class a little bit but you then have to work out how this will fall within cache lines. Looked into pahole which doesn't work easily on mancouch with our binaries. Pahole gdb works out of the box and is easy to add to gdb. It's also just a python script so I forked it and modified it to print in the cache lines for me. You can find it at https://github.com/BenHuddleston/pahole-gdb/blob/master/pahole.py.

            ConnHandler before

            (gdb) pahole ConnHandler -c
            Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
            /*   96     */ struct ConnHandler {
            /*   0    8 */    int (**)(void) _vptr.ConnHandler
            /*   8    8 */    EventuallyPersistentEngine & engine_
            /*  16    8 */    EPStats & stats
            /*  24   16 */    std::shared_ptr<BucketLogger> logger
            /*  40   32 */    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name
            ========================
            /*  64   32 */    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name
            /*  72    8 */    std::atomic<void*> cookie
            /*  80    1 */    std::atomic<bool> reserved
            /* XXX 24 bit hole, try to pack */
            /*  84    4 */    std::atomic<unsigned int> created
            /*  88    1 */    std::atomic<bool> disconnect
            /*  89    1 */    std::atomic<bool> supportAck
            /*  90    1 */    std::atomic<bool> paused
            /* XXX 8 bit hole, try to pack */
            /*  92    4 */    std::atomic<ConnHandler::PausedReason> reason
                          }
            

            DcpProducer before

            (gdb) pahole DcpProducer -c
            Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
            /*  896     */ struct DcpProducer {
            /*   0   96 */    ConnHandler ConnHandler
            ========================
            /*  64   96 */    ConnHandler ConnHandler
            /*  96   16 */    std::enable_shared_from_this<DcpProducer> std::enable_shared_from_this<DcpProducer>
            /* 112   24 */    struct {...} noopCtx
            ========================
            /* 128   24 */    struct {...} noopCtx
            /* 136    4 */    Couchbase::RelaxedAtomic<unsigned int> lastReceiveTime
            /* XXX 32 bit hole, try to pack */
            /* 144    8 */    std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > rejectResp
            /* 152    1 */    bool notifyOnly
            /* 153    1 */    Couchbase::RelaxedAtomic<bool> enableExtMetaData
            /* 154    1 */    Couchbase::RelaxedAtomic<bool> forceValueCompression
            /* 155    1 */    Couchbase::RelaxedAtomic<bool> supportsCursorDropping
            /* 156    1 */    Couchbase::RelaxedAtomic<bool> sendStreamEndOnClientStreamClose
            /* 157    1 */    Couchbase::RelaxedAtomic<bool> supportsHifiMFU
            /* 158    1 */    Couchbase::RelaxedAtomic<bool> enableExpiryOpcode
            /* XXX 8 bit hole, try to pack */
            /* 160    4 */    Couchbase::RelaxedAtomic<unsigned int> lastSendTime
            /* XXX 32 bit hole, try to pack */
            /* 168   88 */    DcpProducer::BufferLog log
            ========================
            /* 192   88 */    DcpProducer::BufferLog log
            ========================
            /* 256   16 */    std::shared_ptr<BackfillManager> backfillMgr
            /* 272  176 */    DcpReadyQueue ready
            ========================
            /* 320  176 */    DcpReadyQueue ready
            ========================
            /* 384  176 */    DcpReadyQueue ready
            ========================
            /* 448  112 */    AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
            ========================
            /* 512  112 */    AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
            /* 560    8 */    std::atomic<unsigned long> itemsSent
            /* 568    8 */    std::atomic<unsigned long> totalBytesSent
            ========================
            /* 576    8 */    std::atomic<unsigned long> totalUncompressedDataSize
            /* XXX 448 bit hole, try to pack */
            /* 640  128 */    cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator
            ========================
            /* 704  128 */    cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator
            ========================
            /* 768    1 */    IncludeValue includeValue
            /* 769    1 */    IncludeXattrs includeXattrs
            /* 770    1 */    IncludeDeleteTime includeDeleteTime
            /* 771    1 */    const bool createChkPtProcessorTsk
                          }
            

            ConnHandler was a different size than on MacOS. Going to pad ConnHandler out to the end of the cache line it sits in. Note the hole in ConnHandler, this can be filled by moving the reserved member.
            ConnHandler after

            (gdb) pahole ConnHandler -c
            Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
            /*  128     */ struct ConnHandler {
            /*   0    8 */    int (**)(void) _vptr.ConnHandler
            /*   8    8 */    EventuallyPersistentEngine & engine_
            /*  16    8 */    EPStats & stats
            /*  24   16 */    std::shared_ptr<BucketLogger> logger
            /*  40   32 */    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name
            ========================
            /*  64   32 */    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name
            /*  72    8 */    std::atomic<void*> cookie
            /*  80    4 */    std::atomic<unsigned int> created
            /*  84    1 */    std::atomic<bool> reserved
            /*  85    1 */    std::atomic<bool> disconnect
            /*  86    1 */    std::atomic<bool> supportAck
            /*  87    1 */    std::atomic<bool> paused
            /*  88    4 */    std::atomic<ConnHandler::PausedReason> reason
            /*  92   36 */    uint8_t [36] padding
            ========================
                          }
            

            DcpProducer after

            (gdb) pahole DcpProducer -c
            Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
            /*  896     */ struct DcpProducer {
            /*   0  128 */    ConnHandler ConnHandler
            ========================
            /*  64  128 */    ConnHandler ConnHandler
            ========================
            /* 128   16 */    std::enable_shared_from_this<DcpProducer> std::enable_shared_from_this<DcpProducer>
            /* 144   24 */    struct {...} noopCtx
            /* 168    4 */    Couchbase::RelaxedAtomic<unsigned int> lastReceiveTime
            /* XXX 32 bit hole, try to pack */
            /* 176    8 */    std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > rejectResp
            /* 184    1 */    bool notifyOnly
            /* 185    1 */    Couchbase::RelaxedAtomic<bool> enableExtMetaData
            /* 186    1 */    Couchbase::RelaxedAtomic<bool> forceValueCompression
            /* 187    1 */    Couchbase::RelaxedAtomic<bool> supportsCursorDropping
            /* 188    1 */    Couchbase::RelaxedAtomic<bool> sendStreamEndOnClientStreamClose
            /* 189    1 */    Couchbase::RelaxedAtomic<bool> supportsHifiMFU
            /* 190    1 */    Couchbase::RelaxedAtomic<bool> enableExpiryOpcode
            /* XXX 8 bit hole, try to pack */
            /* 192    4 */    Couchbase::RelaxedAtomic<unsigned int> lastSendTime
            /* XXX 32 bit hole, try to pack */
            /* 200   88 */    DcpProducer::BufferLog log
            ========================
            /* 256   88 */    DcpProducer::BufferLog log
            /* 288   16 */    std::shared_ptr<BackfillManager> backfillMgr
            /* 304  176 */    DcpReadyQueue ready
            ========================
            /* 320  176 */    DcpReadyQueue ready
            ========================
            /* 384  176 */    DcpReadyQueue ready
            ========================
            /* 448  176 */    DcpReadyQueue ready
            /* 480  112 */    AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
            ========================
            /* 512  112 */    AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
            ========================
            /* 576  112 */    AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
            /* 592    8 */    std::atomic<unsigned long> itemsSent
            /* 600    8 */    std::atomic<unsigned long> totalBytesSent
            /* 608    8 */    std::atomic<unsigned long> totalUncompressedDataSize
            /* XXX 192 bit hole, try to pack */
            /* 640  128 */    cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator
            ========================
            /* 704  128 */    cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator
            ========================
            /* 768    1 */    IncludeValue includeValue
            /* 769    1 */    IncludeXattrs includeXattrs
            /* 770    1 */    IncludeDeleteTime includeDeleteTime
            /* 771    1 */    const bool createChkPtProcessorTsk
                          }
            

            Results from testing this (http://review.couchbase.org/#/c/103899/)

             1553 ops/s 1558 ops/s Suspect Commit ops/s Padded ops/s
            1962776 1965001 1931517 1967325
            1956093 1885716 1931540 1961733
            1952747 1938901 1933941 1982247
            1979286 1935830 1959403 1996264
            2017269 1932842 1934353 1991822
            1975075 1896938   1960995
            1995609 1930222   2005140
            1952298 1906424   1998894
            Averages ======== ======== ========
            1,973,894.125 1,923,984.25 1,938,150.8 1,983,052.5

            I also tried padding holes in the DcpProducer and separating the DcpReadyQueue and StreamsMap (http://review.couchbase.org/#/c/103971/3/engines/ep/src/dcp/producer.h)but this regressed performance.

            ben.huddleston Ben Huddleston added a comment - - edited Spent a chunk of time looking at this without updating here. Lets start with some numbers.  1553 ops/s 1558 ops/s 1962776 1965001 1956093 1885716 1952747 1938901 1979286 1935830 2017269 1932842 1975075 1896938 1995609 1930222 1952298 1906424 Averages ======== 1,973,894.125 1,923,984.25 So we have a drop of 50k ops/s between 1553 and 1558 over the 8 runs performed. A refactor to ConnHandler was the suspect commit ( http://review.couchbase.org/#/c/101191/ ). Created a toy build to test this out.  1553 ops/s 1558 ops/s Suspect Commit ops/s 1962776 1965001 1931517 1956093 1885716 1931540 1952747 1938901 1933941 1979286 1935830 1959403 2017269 1932842 1934353 1975075 1896938 1995609 1930222 1952298 1906424 Averages ======== ======== 1,973,894.125 1,923,984.25 1,938,150.8 Only did 5 runs of this, but it looks like the regression is here. I found a fix for this after playing about with the layout of ConnHandler. Spotted that previously in the DcpProducer the shared from this ptr was sharing a cache line with the paused reason string but not the mutex. This was rolled into an atomic that now sits in a cache line with the start of the DcpProducer. Decided to start by separating these. Couple of ways of doing this, can dump the layout of a class using clang or gcc if you modify a class a little bit but you then have to work out how this will fall within cache lines. Looked into pahole which doesn't work easily on mancouch with our binaries. Pahole gdb works out of the box and is easy to add to gdb. It's also just a python script so I forked it and modified it to print in the cache lines for me. You can find it at https://github.com/BenHuddleston/pahole-gdb/blob/master/pahole.py . ConnHandler before (gdb) pahole ConnHandler -c Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object. /* 96 */ struct ConnHandler { /* 0 8 */ int (**)(void) _vptr.ConnHandler /* 8 8 */ EventuallyPersistentEngine & engine_ /* 16 8 */ EPStats & stats /* 24 16 */ std::shared_ptr<BucketLogger> logger /* 40 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name ======================== /* 64 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name /* 72 8 */ std::atomic<void*> cookie /* 80 1 */ std::atomic<bool> reserved /* XXX 24 bit hole, try to pack */ /* 84 4 */ std::atomic<unsigned int> created /* 88 1 */ std::atomic<bool> disconnect /* 89 1 */ std::atomic<bool> supportAck /* 90 1 */ std::atomic<bool> paused /* XXX 8 bit hole, try to pack */ /* 92 4 */ std::atomic<ConnHandler::PausedReason> reason } DcpProducer before (gdb) pahole DcpProducer -c Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object. /* 896 */ struct DcpProducer { /* 0 96 */ ConnHandler ConnHandler ======================== /* 64 96 */ ConnHandler ConnHandler /* 96 16 */ std::enable_shared_from_this<DcpProducer> std::enable_shared_from_this<DcpProducer> /* 112 24 */ struct {...} noopCtx ======================== /* 128 24 */ struct {...} noopCtx /* 136 4 */ Couchbase::RelaxedAtomic<unsigned int> lastReceiveTime /* XXX 32 bit hole, try to pack */ /* 144 8 */ std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > rejectResp /* 152 1 */ bool notifyOnly /* 153 1 */ Couchbase::RelaxedAtomic<bool> enableExtMetaData /* 154 1 */ Couchbase::RelaxedAtomic<bool> forceValueCompression /* 155 1 */ Couchbase::RelaxedAtomic<bool> supportsCursorDropping /* 156 1 */ Couchbase::RelaxedAtomic<bool> sendStreamEndOnClientStreamClose /* 157 1 */ Couchbase::RelaxedAtomic<bool> supportsHifiMFU /* 158 1 */ Couchbase::RelaxedAtomic<bool> enableExpiryOpcode /* XXX 8 bit hole, try to pack */ /* 160 4 */ Couchbase::RelaxedAtomic<unsigned int> lastSendTime /* XXX 32 bit hole, try to pack */ /* 168 88 */ DcpProducer::BufferLog log ======================== /* 192 88 */ DcpProducer::BufferLog log ======================== /* 256 16 */ std::shared_ptr<BackfillManager> backfillMgr /* 272 176 */ DcpReadyQueue ready ======================== /* 320 176 */ DcpReadyQueue ready ======================== /* 384 176 */ DcpReadyQueue ready ======================== /* 448 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams ======================== /* 512 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams /* 560 8 */ std::atomic<unsigned long> itemsSent /* 568 8 */ std::atomic<unsigned long> totalBytesSent ======================== /* 576 8 */ std::atomic<unsigned long> totalUncompressedDataSize /* XXX 448 bit hole, try to pack */ /* 640 128 */ cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator ======================== /* 704 128 */ cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator ======================== /* 768 1 */ IncludeValue includeValue /* 769 1 */ IncludeXattrs includeXattrs /* 770 1 */ IncludeDeleteTime includeDeleteTime /* 771 1 */ const bool createChkPtProcessorTsk } ConnHandler was a different size than on MacOS. Going to pad ConnHandler out to the end of the cache line it sits in. Note the hole in ConnHandler, this can be filled by moving the reserved member. ConnHandler after (gdb) pahole ConnHandler -c Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object. /* 128 */ struct ConnHandler { /* 0 8 */ int (**)(void) _vptr.ConnHandler /* 8 8 */ EventuallyPersistentEngine & engine_ /* 16 8 */ EPStats & stats /* 24 16 */ std::shared_ptr<BucketLogger> logger /* 40 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name ======================== /* 64 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name /* 72 8 */ std::atomic<void*> cookie /* 80 4 */ std::atomic<unsigned int> created /* 84 1 */ std::atomic<bool> reserved /* 85 1 */ std::atomic<bool> disconnect /* 86 1 */ std::atomic<bool> supportAck /* 87 1 */ std::atomic<bool> paused /* 88 4 */ std::atomic<ConnHandler::PausedReason> reason /* 92 36 */ uint8_t [36] padding ======================== } DcpProducer after (gdb) pahole DcpProducer -c Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object. /* 896 */ struct DcpProducer { /* 0 128 */ ConnHandler ConnHandler ======================== /* 64 128 */ ConnHandler ConnHandler ======================== /* 128 16 */ std::enable_shared_from_this<DcpProducer> std::enable_shared_from_this<DcpProducer> /* 144 24 */ struct {...} noopCtx /* 168 4 */ Couchbase::RelaxedAtomic<unsigned int> lastReceiveTime /* XXX 32 bit hole, try to pack */ /* 176 8 */ std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > rejectResp /* 184 1 */ bool notifyOnly /* 185 1 */ Couchbase::RelaxedAtomic<bool> enableExtMetaData /* 186 1 */ Couchbase::RelaxedAtomic<bool> forceValueCompression /* 187 1 */ Couchbase::RelaxedAtomic<bool> supportsCursorDropping /* 188 1 */ Couchbase::RelaxedAtomic<bool> sendStreamEndOnClientStreamClose /* 189 1 */ Couchbase::RelaxedAtomic<bool> supportsHifiMFU /* 190 1 */ Couchbase::RelaxedAtomic<bool> enableExpiryOpcode /* XXX 8 bit hole, try to pack */ /* 192 4 */ Couchbase::RelaxedAtomic<unsigned int> lastSendTime /* XXX 32 bit hole, try to pack */ /* 200 88 */ DcpProducer::BufferLog log ======================== /* 256 88 */ DcpProducer::BufferLog log /* 288 16 */ std::shared_ptr<BackfillManager> backfillMgr /* 304 176 */ DcpReadyQueue ready ======================== /* 320 176 */ DcpReadyQueue ready ======================== /* 384 176 */ DcpReadyQueue ready ======================== /* 448 176 */ DcpReadyQueue ready /* 480 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams ======================== /* 512 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams ======================== /* 576 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams /* 592 8 */ std::atomic<unsigned long> itemsSent /* 600 8 */ std::atomic<unsigned long> totalBytesSent /* 608 8 */ std::atomic<unsigned long> totalUncompressedDataSize /* XXX 192 bit hole, try to pack */ /* 640 128 */ cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator ======================== /* 704 128 */ cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator ======================== /* 768 1 */ IncludeValue includeValue /* 769 1 */ IncludeXattrs includeXattrs /* 770 1 */ IncludeDeleteTime includeDeleteTime /* 771 1 */ const bool createChkPtProcessorTsk } Results from testing this ( http://review.couchbase.org/#/c/103899/ )  1553 ops/s 1558 ops/s Suspect Commit ops/s Padded ops/s 1962776 1965001 1931517 1967325 1956093 1885716 1931540 1961733 1952747 1938901 1933941 1982247 1979286 1935830 1959403 1996264 2017269 1932842 1934353 1991822 1975075 1896938   1960995 1995609 1930222   2005140 1952298 1906424   1998894 Averages ======== ======== ======== 1,973,894.125 1,923,984.25 1,938,150.8 1,983,052.5 I also tried padding holes in the DcpProducer and separating the DcpReadyQueue and StreamsMap ( http://review.couchbase.org/#/c/103971/3/engines/ep/src/dcp/producer.h)but this regressed performance.
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Ben Huddleston Nice!

            I was looking for the hot-path that could cause the regression, and this could be the one:

            void DcpProducer::scheduleNotify() {
                engine_.getDcpConnMap().addConnectionToPending(shared_from_this());
            }
            

            At regression, when we write into 'std::atomic<ConnHandler::PausedReason> reason' the cache-line is invalidated. Then the call to shared_from_this() goes to memory.
            I believe that here the invalidation happens even when we just read 'reason', given that it is std::atomic. So it happens quite often.

            As you said in your comment above:

            Spotted that previously in the DcpProducer the shared from this ptr was sharing a cache line with the paused reason string but not the mutex

            Now I see that with your change you have even improved compared to pre-regression. That's probably because now there's no sharing at all.

            paolo.cocchi Paolo Cocchi added a comment - - edited Ben Huddleston Nice! I was looking for the hot-path that could cause the regression, and this could be the one: void DcpProducer::scheduleNotify() { engine_.getDcpConnMap().addConnectionToPending(shared_from_this()); } At regression, when we write into 'std::atomic<ConnHandler::PausedReason> reason' the cache-line is invalidated. Then the call to shared_from_this() goes to memory. I believe that here the invalidation happens even when we just read 'reason', given that it is std::atomic. So it happens quite often. As you said in your comment above: Spotted that previously in the DcpProducer the shared from this ptr was sharing a cache line with the paused reason string but not the mutex Now I see that with your change you have even improved compared to pre-regression. That's probably because now there's no sharing at all.
            drigby Dave Rigby added a comment -

            At regression, when we write into 'std::atomic<ConnHandler::PausedReason> reason' the cache-line is invalidated. Then the call to shared_from_this() goes to memory.
            I believe that here the invalidation happens even when we just read 'reason', given that it is std::atomic. So it happens quite often.

            That's not the case - a cacheline is only marked dirty when it is modified - even for atomic ops. Note it's also perfectly possible for the same cacheline to reside in the clean state in different CPU caches.

            drigby Dave Rigby added a comment - At regression, when we write into 'std::atomic<ConnHandler::PausedReason> reason' the cache-line is invalidated. Then the call to shared_from_this() goes to memory. I believe that here the invalidation happens even when we just read 'reason', given that it is std::atomic. So it happens quite often. That's not the case - a cacheline is only marked dirty when it is modified - even for atomic ops. Note it's also perfectly possible for the same cacheline to reside in the clean state in different CPU caches.

            Looks like the cause of this one is the moving of the StreamsMap in the DcpProducer.

            From build 1553 + the suspect commit we have the following perf c2c data. Re-ordered so that it follows the in memory layout (these two cache lines are contiguous).

            #
            #        ----- HITM -----  -- Store Refs --        Data address                      ---------- cycles ----------    Total       cpu                                                                                   Shared
            #   Num      Rmt      Lcl   L1 Hit  L1 Miss              Offset        Code address  rmt hitm  lcl hitm      load  records       cnt                                                                        Symbol     Object            Source:Line  Node{cpus %hitms %stores}
            # .....  .......  .......  .......  .......  ..................  ..................  ........  ........  ........  .......  ........  ............................................................................  .........  .....................  ....
            #
              -------------------------------------------------------------
                  4      186      208     4098       54      0x7f3f9b378a40
              -------------------------------------------------------------
                       4.30%    8.65%    0.00%    0.00%                 0x0      0x7f3f702095c5       291       174       165      621        40  [.] DcpProducer::findStream  ep.so               hashtable.h:1541   0{40  53.8%    n/a}  1{40  46.2%    n/a}
                      37.10%   43.27%    0.00%    0.00%                 0x8      0x7f3f702095b1       295       187       172      928        40  [.] DcpProducer::findStream  ep.so               rwlock.h:66        0{40  66.7%    n/a}  1{40  33.3%    n/a}
                      13.98%   12.02%   41.95%   77.78%                0x38      0x7f3f9d5fb5a7       975       672       461     2860        40  [.] pthread_rwlock_unlock    libpthread-2.17.so  .:0                0{40  58.8%  62.6%}  1{40  41.2%  37.4%}
                      12.90%    8.65%   10.32%    0.00%                0x38      0x7f3f9d5fb600       953       423       371     1343        40  [.] pthread_rwlock_unlock    libpthread-2.17.so  .:0                0{40  47.6%  62.6%}  1{40  52.4%  37.4%}
                       9.14%    4.81%    9.44%    0.00%                0x38      0x7f3f9d5fb0b1       781       443       260     1540        40  [.] pthread_rwlock_rdlock    libpthread-2.17.so  .:0                0{40  77.8%  66.4%}  1{40  22.2%  33.6%}
                       3.76%    4.81%   33.19%    1.85%                0x38      0x7f3f9d5fb05b       632       389       243     2597        40  [.] pthread_rwlock_rdlock    libpthread-2.17.so  .:0                0{40  52.9%  60.8%}  1{40  47.1%  39.2%}
                       1.61%    2.40%    3.07%    0.00%                0x38      0x7f3f9d5fe42f       906       526       233      401        40  [.] __lll_lock_wait          libpthread-2.17.so  .:0                0{40  37.5%  57.1%}  1{40  62.5%  42.9%}
                       0.00%    0.00%    0.02%    0.00%                0x38      0x7f3f9d5fe4b2         0         0         0        1         1  [.] __lll_unlock_wake        libpthread-2.17.so  .:0                                     1{ 1    n/a 100.0%}
                       9.14%    8.17%    1.10%   16.67%                0x3c      0x7f3f9d5fb5b3       436       174       187      273        40  [.] pthread_rwlock_unlock    libpthread-2.17.so  .:0                0{40  61.8%  63.0%}  1{40  38.2%  37.0%}
                       8.06%    7.21%    0.90%    3.70%                0x3c      0x7f3f9d5fb0ac       377       137       194     1077        40  [.] pthread_rwlock_rdlock    libpthread-2.17.so  .:0                0{40  73.3%  61.5%}  1{40  26.7%  38.5%}
            -------------------------------------------------------------
                  2      396      259       71       14      0x7f3f9b378a80
              -------------------------------------------------------------
                       0.25%    0.00%    0.00%    0.00%                 0x8      0x7f3f9d5fb5ce       238         0       185       32        22  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0                 0{22 100.0%    n/a}  1{22   0.0%    n/a}
                       6.31%    3.47%    0.00%    0.00%                 0xc      0x7f3f9d5fb068       304       128       179      968        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0                 0{40  61.8%    n/a}  1{40  38.2%    n/a}
                      88.13%   93.82%    0.00%    0.00%                0x10      0x7f3f9d5fb061       284       121       119      884        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0                 0{40  55.2%    n/a}  1{40  44.8%    n/a}
                       3.03%    1.54%    0.00%    0.00%                0x10      0x7f3f9d5fb5ad       277       144       128      117        39  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0                 0{39  43.8%    n/a}  1{39  56.2%    n/a}
                       0.00%    0.00%    2.82%  100.00%                0x10      0x7f3f9d5fb5b8         0         0         0       16        14  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0                 0{14    n/a  56.2%}  1{14    n/a  43.8%}
                       1.26%    1.16%    0.00%    0.00%                0x14      0x7f3f9d5fb0ba       416       370       179       29        18  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0                 0{18  50.0%    n/a}  1{18  50.0%    n/a}
                       0.76%    0.00%    0.00%    0.00%                0x14      0x7f3f9d5fb612       240         0       246       20        16  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0                 0{16  66.7%    n/a}  1{16  33.3%    n/a}
                       0.00%    0.00%   43.66%    0.00%                0x30      0x7f3f7020d4ae         0         0       337       73        27  [.] DcpProducer::step      ep.so               atomic_base.h:514   0{27    n/a  45.2%}  1{27    n/a  54.8%}
                       0.25%    0.00%   53.52%    0.00%                0x38      0x7f3f7020d4c3      1003         0       299       86        27  [.] DcpProducer::step      ep.so               atomic_base.h:514   0{27 100.0%  78.9%}  1{27   0.0%  21.1%}
            

            From the first cache line we can see that the hash table in the unordered map of the StreamsMap is sharing with the start of the mutex that locks access to it. All the reads from the hash table (first two columns) are contending with reads and writes to the mutex. This is false sharing. In the second cache line, the mutex members (mainly read) contend with writes to the atomic variables that follow (itemsSent and totalBytesSent), again, false sharing. To fix this, we can wrap the mutex in the streams map with some padding. This gave an average ops/s over 9 runs of 1,972,524. This is within 2k of build 1553 (last good build) which is back to pre-regression values.

            In looking at this regression, the following cache line was of interest.

            #
            #        ----- HITM -----  -- Store Refs --        Data address                      ---------- cycles ----------    Total       cpu                                                                                   Shared
            #   Num      Rmt      Lcl   L1 Hit  L1 Miss              Offset        Code address  rmt hitm  lcl hitm      load  records       cnt                                                                        Symbol     Object            Source:Line  Node{cpus %hitms %stores}
            # .....  .......  .......  .......  .......  ..................  ..................  ........  ........  ........  .......  ........  ............................................................................  .........  .....................  ....
            #
              -------------------------------------------------------------
                  0      467      473     1377        0      0x7f3f9b378800
              -------------------------------------------------------------
                      95.07%   91.54%    0.00%    0.00%                 0x8      0x7f3f701d9a02       274       127       170      934        40  [.] ActiveStream::notifyStreamReady                                           ep.so      shared_ptr_base.h:202   0{40  57.2%    n/a}  1{40  42.8%    n/a}
                       1.50%    2.96%   47.57%    0.00%                 0x8      0x7f3f701d9a10       707       362       277     1672        40  [.] ActiveStream::notifyStreamReady                                           ep.so      shared_ptr_base.h:294   0{40  61.9%  60.8%}  1{40  38.1%  39.2%}
                       1.71%    1.90%   49.60%    0.00%                 0x8            0x4282be       972       570       286     1689        40  [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release             memcached  shared_ptr_base.h:151   0{40  70.6%  62.1%}  1{40  29.4%  37.9%}
                       1.28%    1.48%    0.00%    0.00%                 0x8      0x7f3f701d95a1       266       111       206       29        19  [.] std::__shared_ptr<DcpProducer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr  ep.so      shared_ptr_base.h:202   0{19  69.2%    n/a}  1{19  30.8%    n/a}
                       0.43%    1.48%    0.00%    0.00%                 0x8      0x7f3f701d9a16       382       205       285     1092        40  [.] ActiveStream::notifyStreamReady                                           ep.so      shared_ptr_base.h:202   0{40  66.7%    n/a}  1{40  33.3%    n/a}
                       0.00%    0.42%    0.00%    0.00%                 0x8      0x7f3f701d95ba         0       487       217       44        24  [.] std::__shared_ptr<DcpProducer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr  ep.so      shared_ptr_base.h:202   0{24 100.0%    n/a}  1{24   0.0%    n/a}
                       0.00%    0.21%    2.83%    0.00%                 0x8      0x7f3f701d95ab         0       180       200       77        29  [.] std::__shared_ptr<DcpProducer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr  ep.so      shared_ptr_base.h:294   0{29 100.0%  79.5%}  1{29   0.0%  20.5%}
            

            This cache line features heavily in every perf c2c profile (even in the good builds). This contention in this cache line is coming from the shared pointer control block of a DcpProducer. Every ActiveStream will attempt to call notifyStreamReady when it has more items to send which will attempt to lock it's weak pointer to the DcpProducer which increments the shared ref count in the shared pointer control block. The weak pointer in ActiveStream was converted from a shared pointer to prevent a memory leak if the ActiveStream was not released by a DcpProducer (the own references to each other). Will be looking into reverting this change.

            ben.huddleston Ben Huddleston added a comment - Looks like the cause of this one is the moving of the StreamsMap in the DcpProducer. From build 1553 + the suspect commit we have the following perf c2c data. Re-ordered so that it follows the in memory layout (these two cache lines are contiguous). # # ----- HITM ----- -- Store Refs -- Data address ---------- cycles ---------- Total cpu Shared # Num Rmt Lcl L1 Hit L1 Miss Offset Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node{cpus %hitms %stores} # ..... ....... ....... ....... ....... .................. .................. ........ ........ ........ ....... ........ ............................................................................ ......... ..................... .... # ------------------------------------------------------------- 4 186 208 4098 54 0x7f3f9b378a40 ------------------------------------------------------------- 4.30% 8.65% 0.00% 0.00% 0x0 0x7f3f702095c5 291 174 165 621 40 [.] DcpProducer::findStream ep.so hashtable.h:1541 0{40 53.8% n/a} 1{40 46.2% n/a} 37.10% 43.27% 0.00% 0.00% 0x8 0x7f3f702095b1 295 187 172 928 40 [.] DcpProducer::findStream ep.so rwlock.h:66 0{40 66.7% n/a} 1{40 33.3% n/a} 13.98% 12.02% 41.95% 77.78% 0x38 0x7f3f9d5fb5a7 975 672 461 2860 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 58.8% 62.6%} 1{40 41.2% 37.4%} 12.90% 8.65% 10.32% 0.00% 0x38 0x7f3f9d5fb600 953 423 371 1343 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 47.6% 62.6%} 1{40 52.4% 37.4%} 9.14% 4.81% 9.44% 0.00% 0x38 0x7f3f9d5fb0b1 781 443 260 1540 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 77.8% 66.4%} 1{40 22.2% 33.6%} 3.76% 4.81% 33.19% 1.85% 0x38 0x7f3f9d5fb05b 632 389 243 2597 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 52.9% 60.8%} 1{40 47.1% 39.2%} 1.61% 2.40% 3.07% 0.00% 0x38 0x7f3f9d5fe42f 906 526 233 401 40 [.] __lll_lock_wait libpthread-2.17.so .:0 0{40 37.5% 57.1%} 1{40 62.5% 42.9%} 0.00% 0.00% 0.02% 0.00% 0x38 0x7f3f9d5fe4b2 0 0 0 1 1 [.] __lll_unlock_wake libpthread-2.17.so .:0 1{ 1 n/a 100.0%} 9.14% 8.17% 1.10% 16.67% 0x3c 0x7f3f9d5fb5b3 436 174 187 273 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 61.8% 63.0%} 1{40 38.2% 37.0%} 8.06% 7.21% 0.90% 3.70% 0x3c 0x7f3f9d5fb0ac 377 137 194 1077 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 73.3% 61.5%} 1{40 26.7% 38.5%} ------------------------------------------------------------- 2 396 259 71 14 0x7f3f9b378a80 ------------------------------------------------------------- 0.25% 0.00% 0.00% 0.00% 0x8 0x7f3f9d5fb5ce 238 0 185 32 22 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{22 100.0% n/a} 1{22 0.0% n/a} 6.31% 3.47% 0.00% 0.00% 0xc 0x7f3f9d5fb068 304 128 179 968 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 61.8% n/a} 1{40 38.2% n/a} 88.13% 93.82% 0.00% 0.00% 0x10 0x7f3f9d5fb061 284 121 119 884 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 55.2% n/a} 1{40 44.8% n/a} 3.03% 1.54% 0.00% 0.00% 0x10 0x7f3f9d5fb5ad 277 144 128 117 39 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{39 43.8% n/a} 1{39 56.2% n/a} 0.00% 0.00% 2.82% 100.00% 0x10 0x7f3f9d5fb5b8 0 0 0 16 14 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{14 n/a 56.2%} 1{14 n/a 43.8%} 1.26% 1.16% 0.00% 0.00% 0x14 0x7f3f9d5fb0ba 416 370 179 29 18 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{18 50.0% n/a} 1{18 50.0% n/a} 0.76% 0.00% 0.00% 0.00% 0x14 0x7f3f9d5fb612 240 0 246 20 16 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{16 66.7% n/a} 1{16 33.3% n/a} 0.00% 0.00% 43.66% 0.00% 0x30 0x7f3f7020d4ae 0 0 337 73 27 [.] DcpProducer::step ep.so atomic_base.h:514 0{27 n/a 45.2%} 1{27 n/a 54.8%} 0.25% 0.00% 53.52% 0.00% 0x38 0x7f3f7020d4c3 1003 0 299 86 27 [.] DcpProducer::step ep.so atomic_base.h:514 0{27 100.0% 78.9%} 1{27 0.0% 21.1%} From the first cache line we can see that the hash table in the unordered map of the StreamsMap is sharing with the start of the mutex that locks access to it. All the reads from the hash table (first two columns) are contending with reads and writes to the mutex. This is false sharing. In the second cache line, the mutex members (mainly read) contend with writes to the atomic variables that follow (itemsSent and totalBytesSent), again, false sharing. To fix this, we can wrap the mutex in the streams map with some padding. This gave an average ops/s over 9 runs of 1,972,524. This is within 2k of build 1553 (last good build) which is back to pre-regression values. In looking at this regression, the following cache line was of interest. # # ----- HITM ----- -- Store Refs -- Data address ---------- cycles ---------- Total cpu Shared # Num Rmt Lcl L1 Hit L1 Miss Offset Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node{cpus %hitms %stores} # ..... ....... ....... ....... ....... .................. .................. ........ ........ ........ ....... ........ ............................................................................ ......... ..................... .... # ------------------------------------------------------------- 0 467 473 1377 0 0x7f3f9b378800 ------------------------------------------------------------- 95.07% 91.54% 0.00% 0.00% 0x8 0x7f3f701d9a02 274 127 170 934 40 [.] ActiveStream::notifyStreamReady ep.so shared_ptr_base.h:202 0{40 57.2% n/a} 1{40 42.8% n/a} 1.50% 2.96% 47.57% 0.00% 0x8 0x7f3f701d9a10 707 362 277 1672 40 [.] ActiveStream::notifyStreamReady ep.so shared_ptr_base.h:294 0{40 61.9% 60.8%} 1{40 38.1% 39.2%} 1.71% 1.90% 49.60% 0.00% 0x8 0x4282be 972 570 286 1689 40 [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release memcached shared_ptr_base.h:151 0{40 70.6% 62.1%} 1{40 29.4% 37.9%} 1.28% 1.48% 0.00% 0.00% 0x8 0x7f3f701d95a1 266 111 206 29 19 [.] std::__shared_ptr<DcpProducer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr ep.so shared_ptr_base.h:202 0{19 69.2% n/a} 1{19 30.8% n/a} 0.43% 1.48% 0.00% 0.00% 0x8 0x7f3f701d9a16 382 205 285 1092 40 [.] ActiveStream::notifyStreamReady ep.so shared_ptr_base.h:202 0{40 66.7% n/a} 1{40 33.3% n/a} 0.00% 0.42% 0.00% 0.00% 0x8 0x7f3f701d95ba 0 487 217 44 24 [.] std::__shared_ptr<DcpProducer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr ep.so shared_ptr_base.h:202 0{24 100.0% n/a} 1{24 0.0% n/a} 0.00% 0.21% 2.83% 0.00% 0x8 0x7f3f701d95ab 0 180 200 77 29 [.] std::__shared_ptr<DcpProducer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr ep.so shared_ptr_base.h:294 0{29 100.0% 79.5%} 1{29 0.0% 20.5%} This cache line features heavily in every perf c2c profile (even in the good builds). This contention in this cache line is coming from the shared pointer control block of a DcpProducer. Every ActiveStream will attempt to call notifyStreamReady when it has more items to send which will attempt to lock it's weak pointer to the DcpProducer which increments the shared ref count in the shared pointer control block. The weak pointer in ActiveStream was converted from a shared pointer to prevent a memory leak if the ActiveStream was not released by a DcpProducer (the own references to each other). Will be looking into reverting this change.

            Jim Walker spotted that we make a check in ActiveStream::notifyStreamReady after grabbing the shared pointer to the DcpProducer. We check if there are any items ready to send, and if not, do nothing, so grabbing a shared pointer to the DcpProducer is pointless in this case. Moving the locking of the ActiveStream's weak pointer into the condition (http://review.couchbase.org/#/c/104255/1/engines/ep/src/dcp/active_stream.cc), I ran another 8 perf tests. Results are as follows.

            1553 1553 + Regression + Regression Fix 1553 + Regression + Regression Fix + Swapping condition order
            1962776 1978871 1991917
            1956093 1977653 1978745
            1952747 1945110 2016141
            1979286 1997571 1965412
            2017269 1951145 1967976
            1975075 1964446 2010578
            1995609 1988631 1985555
            1952298 1969411 1956033
              1979882  
            Averages ======== =======
            1973894.125 1972524.444 1984044.625

            Number of runs is relatively low so I wouldn't trust these averages with 100% certainty, but this change does appear to increase performance very slightly. I took some more perf c2c samples; this cache line previously featured in the top 3 most contended for every sample, it now does not feature at all within the samples so there is no longer any contention on this cache line.

            Some other potential inefficiencies were spotted with the hot path in ActiveStream so I'll investigate a few more of those while this is fresh in the mind.

            ben.huddleston Ben Huddleston added a comment - Jim Walker spotted that we make a check in ActiveStream::notifyStreamReady after grabbing the shared pointer to the DcpProducer. We check if there are any items ready to send, and if not, do nothing, so grabbing a shared pointer to the DcpProducer is pointless in this case. Moving the locking of the ActiveStream's weak pointer into the condition ( http://review.couchbase.org/#/c/104255/1/engines/ep/src/dcp/active_stream.cc ), I ran another 8 perf tests. Results are as follows. 1553 1553 + Regression + Regression Fix 1553 + Regression + Regression Fix + Swapping condition order 1962776 1978871 1991917 1956093 1977653 1978745 1952747 1945110 2016141 1979286 1997571 1965412 2017269 1951145 1967976 1975075 1964446 2010578 1995609 1988631 1985555 1952298 1969411 1956033   1979882   Averages ======== ======= 1973894.125 1972524.444 1984044.625 Number of runs is relatively low so I wouldn't trust these averages with 100% certainty, but this change does appear to increase performance very slightly. I took some more perf c2c samples; this cache line previously featured in the top 3 most contended for every sample, it now does not feature at all within the samples so there is no longer any contention on this cache line. Some other potential inefficiencies were spotted with the hot path in ActiveStream so I'll investigate a few more of those while this is fresh in the mind.

            Build couchbase-server-6.5.0-2202 contains kv_engine commit 30b4908 with commit message:
            MB-32107: Reduce false sharing of cache lines in DcpProducer

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-2202 contains kv_engine commit 30b4908 with commit message: MB-32107 : Reduce false sharing of cache lines in DcpProducer

            Build couchbase-server-6.5.0-2207 contains kv_engine commit 154d35e with commit message:
            MB-32107: Reduce cache line contention by moving DcpProducer acquisition

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-2207 contains kv_engine commit 154d35e with commit message: MB-32107 : Reduce cache line contention by moving DcpProducer acquisition

            Build couchbase-server-6.5.0-2207 contains kv_engine commit 154d35e with commit message:
            MB-32107: Reduce cache line contention by moving DcpProducer acquisition

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-2207 contains kv_engine commit 154d35e with commit message: MB-32107 : Reduce cache line contention by moving DcpProducer acquisition

            One other possible simplification spotted was to remove the recursive call to ActiveStream::processItems via ActiveStream::nextCheckpointItemTask (http://review.couchbase.org/#/c/104256/). This is only called if it can't find any items on this current iteration of the task. It seems wasteful to make multiple function calls to just check if there are any more items to process instead of waiting for the next task to run.

            1553 1553 + Regression + Regression Fix 1553 + Regression + Regression Fix + Removing recursive call
            1962776 1978871 1970412
            1956093 1977653 1996668
            1952747 1945110 1994747
            1979286 1997571 1956121
            2017269 1951145 1983625
            1975075 1964446 1966830
            1995609 1988631 1994766
            1952298 1969411 1969566
              1979882  
            Averages ======== ========
            1973894.125 1972524.444 1979091.88

            Given this is just an average for a relatively low number of runs I wouldn't expect this to actually given any noticeable increase in ops/s, however it doesn't look to have regressed performance and it makes the code marginally simpler.

            ben.huddleston Ben Huddleston added a comment - One other possible simplification spotted was to remove the recursive call to ActiveStream::processItems via ActiveStream::nextCheckpointItemTask ( http://review.couchbase.org/#/c/104256/ ). This is only called if it can't find any items on this current iteration of the task. It seems wasteful to make multiple function calls to just check if there are any more items to process instead of waiting for the next task to run. 1553 1553 + Regression + Regression Fix 1553 + Regression + Regression Fix + Removing recursive call 1962776 1978871 1970412 1956093 1977653 1996668 1952747 1945110 1994747 1979286 1997571 1956121 2017269 1951145 1983625 1975075 1964446 1966830 1995609 1988631 1994766 1952298 1969411 1969566   1979882   Averages ======== ======== 1973894.125 1972524.444 1979091.88 Given this is just an average for a relatively low number of runs I wouldn't expect this to actually given any noticeable increase in ops/s, however it doesn't look to have regressed performance and it makes the code marginally simpler.

            Build couchbase-server-6.5.0-2229 contains kv_engine commit 13716fa with commit message:
            MB-32107: Remove recursive nextCheckpointItemTask call in ActiveStream

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-2229 contains kv_engine commit 13716fa with commit message: MB-32107 : Remove recursive nextCheckpointItemTask call in ActiveStream

            Bit of a brain dump.

            Would have expected the build 2202 to improve perf, however after running it repeatedly it does not. The fix actually seems to be making things worse. Unfortunately the previous ~20 builds were broken so this is becoming a bit of a pain to get perf numbers.

            2202 2202 + a revert of the regression fix
              1840475
              1830449
            1820631 1803297
            1811079 1833983
            1832017 1837549
            1822782 1839338
            1825298 1834332
            1807678 1814464
            1842862 1851247
            1814444 1843619
            1829581 1839287
            Averages ========
            1822930.222 1833458.182

            Profiled the build with 2202 + the revert using perf c2c again. This cache line is now pretty hot. It's another producer shared_ptr acquisition.

              -------------------------------------------------------------
                  0      392      408     1173        0      0x7fde45841d00
              -------------------------------------------------------------
                      94.39%   88.24%    0.00%    0.00%                 0x8      0x7fde49b93e22       264       115       151      801        40  [.] ActiveStream::isCompressionEnabled                                                           ep.so
                       1.02%    6.62%   45.52%    0.00%                 0x8      0x7fde49b93e30       971       395       271     1434        40  [.] ActiveStream::isCompressionEnabled                                                           ep.so
                       2.55%    2.21%    0.00%    0.00%                 0x8      0x7fde49b93981       274       128       180       26        20  [.] fmt::v5::internal::format_decimal<unsigned long, char, fmt::v5::internal::no_thousands_sep>  ep.so
                       2.04%    2.21%   51.49%    0.00%                 0x8            0x4299fe       733       486       326     1576        40  [.] 0x00000000000299fe                                                                           memcached
                       0.00%    0.49%    0.00%    0.00%                 0x8      0x7fde49b93e36         0       442       281      929        40  [.] ActiveStream::isCompressionEnabled                                                           ep.so
                       0.00%    0.25%    2.98%    0.00%                 0x8      0x7fde49b9398b         0       282       159       64        29  [.] fmt::v5::internal::format_decimal<unsigned long, char, fmt::v5::internal::no_thousands_sep>  ep.so
            

            I don't think I've seen this cache line in a profile before so may be of note.

             -------------------------------------------------------------
                  2      394      252      343        0            0x7cfc80
              -------------------------------------------------------------
                      96.70%   97.62%    0.00%    0.00%                0x30            0x48f23d       296       149       160      814        40  [.] 0x000000000008f23d  memcached         mcbp_executors.cc:479   0{40  58.7%    n/a}  1{40  41.3%    n/a}
                       3.30%    2.38%  100.00%    0.00%                0x38            0x48d7a0       437       337       237     1579        40  [.] mcbp_add_header     memcached         mcbp.cc:82              0{40  63.2%  63.3%}  1{40  36.8%  36.7%}
            

            Also interested in this cache line. Not entirely sure what it's hitting. 

              -------------------------------------------------------------
                  4      105      123     3132       41      0x7fde45841fc0
              -------------------------------------------------------------
                      16.19%   29.27%   45.88%   78.05%                 0x0      0x7fde4f2d15a7       937       633       441     2432        40  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0               0{40  75.5%  64.5%}  1{40  24.5%  35.5%}
                      11.43%   12.20%   10.38%    0.00%                 0x0      0x7fde4f2d10b1       688       399       258     1294        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0               0{40  51.9%  67.4%}  1{40  48.1%  32.6%}
                      10.48%    8.13%   10.60%    0.00%                 0x0      0x7fde4f2d1600       660       403       377     1137        40  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0               0{40  57.1%  61.1%}  1{40  42.9%  38.9%}
                       5.71%    2.44%   28.10%    2.44%                 0x0      0x7fde4f2d105b       897       469       240     1917        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0               0{40  55.6%  51.0%}  1{40  44.4%  49.0%}
                       3.81%    3.25%    2.30%    0.00%                 0x0      0x7fde4f2d442f       403       608       244      265        40  [.] __lll_lock_wait        libpthread-2.17.so  .:0               0{40  62.5%  59.7%}  1{40  37.5%  40.3%}
                       2.86%    0.81%    1.21%   14.63%                 0x4      0x7fde4f2d15b3       388       119       181      275        40  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0               0{40  50.0%  68.2%}  1{40  50.0%  31.8%}
                       0.95%    1.63%    0.32%    4.88%                 0x4      0x7fde4f2d10ac       254       556       188      946        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0               0{40  66.7%  58.3%}  1{40  33.3%  41.7%}
                       0.00%    2.44%    0.00%    0.00%                0x14      0x7fde4f2d1068         0       177       196      920        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0               0{40  66.7%    n/a}  1{40  33.3%    n/a}
                      19.05%   10.57%    0.00%    0.00%                0x18      0x7fde4f2d15ad       512       127       176      226        39  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0               0{39  57.6%    n/a}  1{39  42.4%    n/a}
                      11.43%    6.50%    0.00%    0.00%                0x18      0x7fde4f2d1061       308       126       191      954        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0               0{40  55.0%    n/a}  1{40  45.0%    n/a}
                       0.00%    0.00%    0.29%    0.00%                0x18      0x7fde4f2d15b8         0         0         0        9         7  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0               0{ 7    n/a  33.3%}  1{ 7    n/a  66.7%}
                       6.67%    8.13%    0.00%    0.00%                0x1c      0x7fde4f2d10ba       396       216       332       21        17  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0               0{17  47.1%    n/a}  1{17  52.9%    n/a}
                       3.81%    8.13%    0.00%    0.00%                0x1c      0x7fde4f2d1608       551       234       312       16        13  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0               0{13  78.6%    n/a}  1{13  21.4%    n/a}
                       4.76%    2.44%    0.00%    0.00%                0x1c      0x7fde4f2d10d5       622       168       154       59        32  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0               0{32  37.5%    n/a}  1{32  62.5%    n/a}
                       1.90%    3.25%    0.00%    0.00%                0x1c      0x7fde4f2d1612       541       139       164       32        25  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0               0{25  50.0%    n/a}  1{25  50.0%    n/a}
                       0.95%    0.81%    0.93%    0.00%                0x38      0x7fde49bd90fd       298       140       226       68        29  [.] DcpProducer::step      ep.so               producer.cc:668   0{29   0.0%  75.9%}  1{29 100.0%  24.1%}
            

            Going to keep looking into producer/active stream changes around this time.

            ben.huddleston Ben Huddleston added a comment - Bit of a brain dump. Would have expected the build 2202 to improve perf, however after running it repeatedly it does not. The fix actually seems to be making things worse. Unfortunately the previous ~20 builds were broken so this is becoming a bit of a pain to get perf numbers. 2202 2202 + a revert of the regression fix   1840475   1830449 1820631 1803297 1811079 1833983 1832017 1837549 1822782 1839338 1825298 1834332 1807678 1814464 1842862 1851247 1814444 1843619 1829581 1839287 Averages ======== 1822930.222 1833458.182 Profiled the build with 2202 + the revert using perf c2c again. This cache line is now pretty hot. It's another producer shared_ptr acquisition. ------------------------------------------------------------- 0 392 408 1173 0 0x7fde45841d00 ------------------------------------------------------------- 94.39% 88.24% 0.00% 0.00% 0x8 0x7fde49b93e22 264 115 151 801 40 [.] ActiveStream::isCompressionEnabled ep.so 1.02% 6.62% 45.52% 0.00% 0x8 0x7fde49b93e30 971 395 271 1434 40 [.] ActiveStream::isCompressionEnabled ep.so 2.55% 2.21% 0.00% 0.00% 0x8 0x7fde49b93981 274 128 180 26 20 [.] fmt::v5::internal::format_decimal<unsigned long, char, fmt::v5::internal::no_thousands_sep> ep.so 2.04% 2.21% 51.49% 0.00% 0x8 0x4299fe 733 486 326 1576 40 [.] 0x00000000000299fe memcached 0.00% 0.49% 0.00% 0.00% 0x8 0x7fde49b93e36 0 442 281 929 40 [.] ActiveStream::isCompressionEnabled ep.so 0.00% 0.25% 2.98% 0.00% 0x8 0x7fde49b9398b 0 282 159 64 29 [.] fmt::v5::internal::format_decimal<unsigned long, char, fmt::v5::internal::no_thousands_sep> ep.so I don't think I've seen this cache line in a profile before so may be of note. ------------------------------------------------------------- 2 394 252 343 0 0x7cfc80 ------------------------------------------------------------- 96.70% 97.62% 0.00% 0.00% 0x30 0x48f23d 296 149 160 814 40 [.] 0x000000000008f23d memcached mcbp_executors.cc:479 0{40 58.7% n/a} 1{40 41.3% n/a} 3.30% 2.38% 100.00% 0.00% 0x38 0x48d7a0 437 337 237 1579 40 [.] mcbp_add_header memcached mcbp.cc:82 0{40 63.2% 63.3%} 1{40 36.8% 36.7%} Also interested in this cache line. Not entirely sure what it's hitting.  ------------------------------------------------------------- 4 105 123 3132 41 0x7fde45841fc0 ------------------------------------------------------------- 16.19% 29.27% 45.88% 78.05% 0x0 0x7fde4f2d15a7 937 633 441 2432 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 75.5% 64.5%} 1{40 24.5% 35.5%} 11.43% 12.20% 10.38% 0.00% 0x0 0x7fde4f2d10b1 688 399 258 1294 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 51.9% 67.4%} 1{40 48.1% 32.6%} 10.48% 8.13% 10.60% 0.00% 0x0 0x7fde4f2d1600 660 403 377 1137 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 57.1% 61.1%} 1{40 42.9% 38.9%} 5.71% 2.44% 28.10% 2.44% 0x0 0x7fde4f2d105b 897 469 240 1917 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 55.6% 51.0%} 1{40 44.4% 49.0%} 3.81% 3.25% 2.30% 0.00% 0x0 0x7fde4f2d442f 403 608 244 265 40 [.] __lll_lock_wait libpthread-2.17.so .:0 0{40 62.5% 59.7%} 1{40 37.5% 40.3%} 2.86% 0.81% 1.21% 14.63% 0x4 0x7fde4f2d15b3 388 119 181 275 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 50.0% 68.2%} 1{40 50.0% 31.8%} 0.95% 1.63% 0.32% 4.88% 0x4 0x7fde4f2d10ac 254 556 188 946 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 66.7% 58.3%} 1{40 33.3% 41.7%} 0.00% 2.44% 0.00% 0.00% 0x14 0x7fde4f2d1068 0 177 196 920 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 66.7% n/a} 1{40 33.3% n/a} 19.05% 10.57% 0.00% 0.00% 0x18 0x7fde4f2d15ad 512 127 176 226 39 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{39 57.6% n/a} 1{39 42.4% n/a} 11.43% 6.50% 0.00% 0.00% 0x18 0x7fde4f2d1061 308 126 191 954 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 55.0% n/a} 1{40 45.0% n/a} 0.00% 0.00% 0.29% 0.00% 0x18 0x7fde4f2d15b8 0 0 0 9 7 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{ 7 n/a 33.3%} 1{ 7 n/a 66.7%} 6.67% 8.13% 0.00% 0.00% 0x1c 0x7fde4f2d10ba 396 216 332 21 17 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{17 47.1% n/a} 1{17 52.9% n/a} 3.81% 8.13% 0.00% 0.00% 0x1c 0x7fde4f2d1608 551 234 312 16 13 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{13 78.6% n/a} 1{13 21.4% n/a} 4.76% 2.44% 0.00% 0.00% 0x1c 0x7fde4f2d10d5 622 168 154 59 32 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{32 37.5% n/a} 1{32 62.5% n/a} 1.90% 3.25% 0.00% 0.00% 0x1c 0x7fde4f2d1612 541 139 164 32 25 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{25 50.0% n/a} 1{25 50.0% n/a} 0.95% 0.81% 0.93% 0.00% 0x38 0x7fde49bd90fd 298 140 226 68 29 [.] DcpProducer::step ep.so producer.cc:668 0{29 0.0% 75.9%} 1{29 100.0% 24.1%} Going to keep looking into producer/active stream changes around this time.
            drigby Dave Rigby added a comment -

            re: ActiveStream::isCompressionEnabled:

            bool ActiveStream::isCompressionEnabled() {
                auto producer = producerPtr.lock();
                if (producer) {
                    return producer->isCompressionEnabled();
                }
                /* If the 'producer' is deleted, what we return doesn't matter */
                return false;
            }
            

            Given that the producer can only change it's compression via a dcp_control message, and for other similar control params we simply take the "current" value of that param to use when a stream is created (e.g. IncludeDeleteTime), one optimisation would be to handle compression in the same way - give ActiveStream it's own member variable set on construction which avoids having to lock the producer.

            Technically this now means you cannot change compression enabled dynamiclly on an existing stream, but I don't think we ever documented that working before, and it's an internal-only setting anyways (used by Backup & XDCR).

            FYI John Liang / Patrick Varley - would making force_value_compression only applicable to streams created after it (and no longer being dynamic) be a problem?

            drigby Dave Rigby added a comment - re: ActiveStream::isCompressionEnabled : bool ActiveStream::isCompressionEnabled() { auto producer = producerPtr.lock(); if (producer) { return producer->isCompressionEnabled(); } /* If the 'producer' is deleted, what we return doesn't matter */ return false; } Given that the producer can only change it's compression via a dcp_control message, and for other similar control params we simply take the "current" value of that param to use when a stream is created (e.g. IncludeDeleteTime ), one optimisation would be to handle compression in the same way - give ActiveStream it's own member variable set on construction which avoids having to lock the producer. Technically this now means you cannot change compression enabled dynamiclly on an existing stream, but I don't think we ever documented that working before, and it's an internal-only setting anyways (used by Backup & XDCR). FYI John Liang / Patrick Varley - would making force_value_compression only applicable to streams created after it (and no longer being dynamic) be a problem?

            Backup does not use "force_value_compression" mode, as you advised against it a while back (I would like to try it at some point). It uses the normal DCP compression mode. backup also does not dynamic change DCP stream settings.

            pvarley Patrick Varley added a comment - Backup does not use "force_value_compression" mode, as you advised against it a while back (I would like to try it at some point). It uses the normal DCP compression mode. backup also does not dynamic change DCP stream settings.
            drigby Dave Rigby added a comment -

            @PV thanks for confirming.

            drigby Dave Rigby added a comment - @PV thanks for confirming.

            Other changes that may impact performance due to the dcp producer layout changing include:

            1. Build 2183 - Add consumer name to producer - http://review.couchbase.org/#/c/104133/
            2. Build 1977 - Change backfill manager from shared_ptr to a custom atomic shared pointer - http://review.couchbase.org/#/c/103209/
            3. Build 1818 - Change StreamsMap for multiple streams per vb - http://review.couchbase.org/#/c/101748/

            Current perf runs from around build 1977 don't suggest any issue, in fact performance seems to increase with the next build tested (1987). Will rerun some tests around here and investigate number 2 if there is some regression.

            I think we can rule out number 3 as far as layout is concerned, the actual map that should sit in the producer should not have changed, just the values inside it which will be off at some other memory locations. Will run pahole/ptype to confirm. May be worth considering cacheline padding the mutexes in StreamContainer though as they probably share with the forward list.

            Of course, the issues could be non layout related.

            ben.huddleston Ben Huddleston added a comment - Other changes that may impact performance due to the dcp producer layout changing include: Build 2183 - Add consumer name to producer - http://review.couchbase.org/#/c/104133/ Build 1977 - Change backfill manager from shared_ptr to a custom atomic shared pointer - http://review.couchbase.org/#/c/103209/ Build 1818 - Change StreamsMap for multiple streams per vb - http://review.couchbase.org/#/c/101748/ Current perf runs from around build 1977 don't suggest any issue, in fact performance seems to increase with the next build tested (1987). Will rerun some tests around here and investigate number 2 if there is some regression. I think we can rule out number 3 as far as layout is concerned, the actual map that should sit in the producer should not have changed, just the values inside it which will be off at some other memory locations. Will run pahole/ptype to confirm. May be worth considering cacheline padding the mutexes in StreamContainer though as they probably share with the forward list. Of course, the issues could be non layout related.
            ben.huddleston Ben Huddleston added a comment - - edited

            Spotted that the mutex in the AtomicSharedPtr to the BackfillManager is straddling two cache lines post build 2183.

            ========================
            /* 192   32 */    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > consumerName
            /* 200    4 */    Couchbase::RelaxedAtomic<unsigned int> lastSendTime
            /* XXX 32 bit hole, try to pack */
            /* 208   88 */    DcpProducer::BufferLog log
            ========================
            /* 256   88 */    DcpProducer::BufferLog log
            /* 296   56 */    cb::AtomicSharedPtr<BackfillManager> backfillMgr
            ========================
            /* 320   56 */    cb::AtomicSharedPtr<BackfillManager> backfillMgr
            /* 352  176 */    DcpReadyQueue ready
            ========================
            

            (gdb) pahole -c cb::AtomicSharedPtr<BackfillManager>
            Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
            /*   56     */ struct cb::AtomicSharedPtr<BackfillManager> {
            /*   0   40 */    std::mutex mutex
            /*  40   16 */    std::shared_ptr<BackfillManager> ptr
                          }
            

            What's it sharing with?

            On the buffer log side it's now sharing with part of the RWLock, the producer reference, maxBytes, bytesOutstanding, and ackedBytes. Quite possibly an issue, but depends on the usage. Notably, this RWLock is now also straddling two cache lines.

            (gdb) pahole DcpProducer::BufferLog
            /*   88     */ struct DcpProducer::BufferLog {
            /*   0   56 */    cb::RWLock logLock
            /*  56    8 */    DcpProducer & producer
            /*  64    8 */    unsigned long maxBytes
            /*  72    8 */    cb::NonNegativeCounter<unsigned long, cb::ThrowExceptionUnderflowPolicy> bytesOutstanding
            /*  80    8 */    Monotonic<unsigned long, ThrowExceptionPolicy, std::greater> ackedBytes
                          }
            

            On the ready queue side, the mutex and the shared_ptr of the AtomicSharedPtr are sharing with part of the mutex in the DcpReadyQueue. Notably, this mutex is now also straddling a cache line.

            (gdb) pahole DcpReadyQueue
            /*  176     */ struct DcpReadyQueue {
            /*   0   40 */    std::mutex lock
            /*  40   80 */    std::queue<Vbid, std::deque<Vbid, std::allocator<Vbid> > > readyQueue
            /* 120   56 */    std::unordered_set<Vbid, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<Vbid> > queuedValues
                          }
            

            Will try to correlate this with the perf c2c data for build 2202.

            ben.huddleston Ben Huddleston added a comment - - edited Spotted that the mutex in the AtomicSharedPtr to the BackfillManager is straddling two cache lines post build 2183. ======================== /* 192 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > consumerName /* 200 4 */ Couchbase::RelaxedAtomic<unsigned int> lastSendTime /* XXX 32 bit hole, try to pack */ /* 208 88 */ DcpProducer::BufferLog log ======================== /* 256 88 */ DcpProducer::BufferLog log /* 296 56 */ cb::AtomicSharedPtr<BackfillManager> backfillMgr ======================== /* 320 56 */ cb::AtomicSharedPtr<BackfillManager> backfillMgr /* 352 176 */ DcpReadyQueue ready ======================== (gdb) pahole -c cb::AtomicSharedPtr<BackfillManager> Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object. /* 56 */ struct cb::AtomicSharedPtr<BackfillManager> { /* 0 40 */ std::mutex mutex /* 40 16 */ std::shared_ptr<BackfillManager> ptr } What's it sharing with? On the buffer log side it's now sharing with part of the RWLock, the producer reference, maxBytes, bytesOutstanding, and ackedBytes. Quite possibly an issue, but depends on the usage. Notably, this RWLock is now also straddling two cache lines. (gdb) pahole DcpProducer::BufferLog /* 88 */ struct DcpProducer::BufferLog { /* 0 56 */ cb::RWLock logLock /* 56 8 */ DcpProducer & producer /* 64 8 */ unsigned long maxBytes /* 72 8 */ cb::NonNegativeCounter<unsigned long, cb::ThrowExceptionUnderflowPolicy> bytesOutstanding /* 80 8 */ Monotonic<unsigned long, ThrowExceptionPolicy, std::greater> ackedBytes } On the ready queue side, the mutex and the shared_ptr of the AtomicSharedPtr are sharing with part of the mutex in the DcpReadyQueue. Notably, this mutex is now also straddling a cache line. (gdb) pahole DcpReadyQueue /* 176 */ struct DcpReadyQueue { /* 0 40 */ std::mutex lock /* 40 80 */ std::queue<Vbid, std::deque<Vbid, std::allocator<Vbid> > > readyQueue /* 120 56 */ std::unordered_set<Vbid, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<Vbid> > queuedValues } Will try to correlate this with the perf c2c data for build 2202.

            Also confirmed there are no unexpected changes to layout of StreamsMap with the StreamContainer changes.

            ben.huddleston Ben Huddleston added a comment - Also confirmed there are no unexpected changes to layout of StreamsMap with the StreamContainer changes.

            Ran a load of perf tests with slightly different fixes for the issues in DcpProducer.

            2202 2202 + move consumer name to end 2202 + cache line pad mutex in atomic shared ptr 2202 + move string and pad atomic shared ptr 2202 + cache line pad rw lock in buffer log 2202 + pad atomic shared ptr and buffer log lock
            1820631     1831132    
            1811079   1844199 1821569 1838826 1864285
            1832017 1829126 1856378 1846906 1852154 1817876
            1822782 1839222 1837934 1848847 1807318 1840398
            1825298 1825093 1841431 1842268 1802737 1820714
            1807678 1835437 1841422 1822030 1814812 1827543
            1842862 1852079 1812351 1847608 1868165 1797235
            1814444 1855286 1825289 1842034 1837529 1845806
            1829581 1833542 1850124 1844421 1842462 1830152
            Averages ======== ======== ======== ======== ========
            1822930.222 1838540.714 1838641 1838535 1833000.375 1830501.125

            I think the Producer might be starting to suffer from poor locality as performance seems to be dropping off with size now (could just be not enough data though). Would suggest cache line padding the mutex in atomic shared ptr as it's more robust than moving the consumer name to the end of the class. Still not getting back the ~50k op/s or so that we lost.

            ben.huddleston Ben Huddleston added a comment - Ran a load of perf tests with slightly different fixes for the issues in DcpProducer. 2202 2202 + move consumer name to end 2202 + cache line pad mutex in atomic shared ptr 2202 + move string and pad atomic shared ptr 2202 + cache line pad rw lock in buffer log 2202 + pad atomic shared ptr and buffer log lock 1820631     1831132     1811079   1844199 1821569 1838826 1864285 1832017 1829126 1856378 1846906 1852154 1817876 1822782 1839222 1837934 1848847 1807318 1840398 1825298 1825093 1841431 1842268 1802737 1820714 1807678 1835437 1841422 1822030 1814812 1827543 1842862 1852079 1812351 1847608 1868165 1797235 1814444 1855286 1825289 1842034 1837529 1845806 1829581 1833542 1850124 1844421 1842462 1830152 Averages ======== ======== ======== ======== ======== 1822930.222 1838540.714 1838641 1838535 1833000.375 1830501.125 I think the Producer might be starting to suffer from poor locality as performance seems to be dropping off with size now (could just be not enough data though). Would suggest cache line padding the mutex in atomic shared ptr as it's more robust than moving the consumer name to the end of the class. Still not getting back the ~50k op/s or so that we lost.
            ben.huddleston Ben Huddleston added a comment - - edited

            Got some info on the RWLock contentious cache line using the callgraph option of perf c2c.

              -------------------------------------------------------------
                  3       22       33      724        8      0x7f9b4e523d80
              -------------------------------------------------------------
                      27.27%   18.18%   47.38%   62.50%                 0x0      0x7f9c23c675a7      1233       767       431      562        40  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0           0{40  58.3%  64.7%}  1{40  41.7%  35.3%}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--45.62%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerUnlock
                           |          cb_rw_reader_exit
                           |          pthread_rwlock_unlock
                           |
                            --1.85%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerUnlock
                                      cb_rw_reader_exit
                                      pthread_rwlock_unlock
             
                      22.73%    0.00%   10.64%    0.00%                 0x0      0x7f9c23c67600       599         0       383      248        40  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0           0{40  40.0%  62.3%}  1{40  60.0%  37.7%}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--10.20%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerUnlock
                           |          cb_rw_reader_exit
                           |          pthread_rwlock_unlock
                           |
                            --0.31%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerUnlock
                                      cb_rw_reader_exit
                                      pthread_rwlock_unlock
             
                       4.55%    9.09%    2.62%    0.00%                 0x0      0x7f9c23c6a42f       453       476       256       75        34  [.] __lll_lock_wait        libpthread-2.17.so  .:0           0{34  75.0%  52.6%}  1{34  25.0%  47.4%}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--2.54%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          |
                           |          |--1.89%--cb::RWLock::readerLock
                           |          |          cb_rw_reader_enter
                           |          |          pthread_rwlock_rdlock
                           |          |          __lll_lock_wait
                           |          |
                           |           --0.65%--cb::RWLock::readerUnlock
                           |                     cb_rw_reader_exit
                           |                     pthread_rwlock_unlock
                           |                     __lll_lock_wait
                           |
                            --0.31%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      |
                                      |--0.31%--cb::RWLock::readerLock
                                      |          cb_rw_reader_enter
                                      |          pthread_rwlock_rdlock
                                      |          __lll_lock_wait
                                      |
                                       --0.00%--cb::RWLock::readerUnlock
                                                 cb_rw_reader_exit
                                                 pthread_rwlock_unlock
                                                 __lll_lock_wait
             
                       4.55%    6.06%    9.25%    0.00%                 0x0      0x7f9c23c670b1       789      1004       261      304        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0           0{40 100.0%  67.2%}  1{40   0.0%  32.8%}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--8.33%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerLock
                           |          cb_rw_reader_enter
                           |          pthread_rwlock_rdlock
                           |
                            --0.64%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerLock
                                      cb_rw_reader_enter
                                      pthread_rwlock_rdlock
             
                       0.00%    6.06%   28.59%   12.50%                 0x0      0x7f9c23c6705b         0       199       246      451        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0           0{40 100.0%  61.1%}  1{40   0.0%  38.9%}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--26.42%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerLock
                           |          cb_rw_reader_enter
                           |          pthread_rwlock_rdlock
                           |
                            --1.83%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerLock
                                      cb_rw_reader_enter
                                      pthread_rwlock_rdlock
             
                       0.00%    3.03%    0.14%    0.00%                 0x4      0x7f9c23c670ac         0       135       184      200        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0           0{40 100.0% 100.0%}  1{40   0.0%   0.0%}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--0.20%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerLock
                           |          cb_rw_reader_enter
                           |          pthread_rwlock_rdlock
                           |
                            --0.00%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerLock
                                      cb_rw_reader_enter
                                      pthread_rwlock_rdlock
             
                       4.55%    0.00%    1.24%   25.00%                 0x4      0x7f9c23c675b3       591         0       184       59        32  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0           0{32   0.0%  63.6%}  1{32 100.0%  36.4%}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--1.30%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerUnlock
                           |          cb_rw_reader_exit
                           |          pthread_rwlock_unlock
                           |
                            --0.14%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerUnlock
                                      cb_rw_reader_exit
                                      pthread_rwlock_unlock
             
                       4.55%    3.03%    0.00%    0.00%                0x14      0x7f9c23c67068       320       128       195      227        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0           0{40  50.0%    n/a}  1{40  50.0%    n/a}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--0.06%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerLock
                           |          cb_rw_reader_enter
                           |          pthread_rwlock_rdlock
                           |
                            --0.00%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerLock
                                      cb_rw_reader_enter
                                      pthread_rwlock_rdlock
             
                      18.18%   12.12%    0.00%    0.00%                0x18      0x7f9c23c67061       379       108       197      215        40  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0           0{40  75.0%    n/a}  1{40  25.0%    n/a}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--0.05%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerLock
                           |          cb_rw_reader_enter
                           |          pthread_rwlock_rdlock
                           |
                            --0.00%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerLock
                                      cb_rw_reader_enter
                                      pthread_rwlock_rdlock
             
                       4.55%   12.12%    0.00%    0.00%                0x18      0x7f9c23c675ad       521       227       192       57        33  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0           0{33  80.0%    n/a}  1{33  20.0%    n/a}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--0.01%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerUnlock
                           |          cb_rw_reader_exit
                           |          pthread_rwlock_unlock
                           |
                            --0.00%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerUnlock
                                      cb_rw_reader_exit
                                      pthread_rwlock_unlock
             
                       0.00%    0.00%    0.14%    0.00%                0x18      0x7f9c23c675b8         0         0         0        1         1  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0                                1{ 1    n/a 100.0%}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           StateMachine::conn_execute
                           Cookie::execute
                           execute_client_request_packet
                           set_executor
                           SteppableCommandContext::drive
                           MutationCommandContext::step
                           MutationCommandContext::storeItem
                           bucket_store_if
                           EventuallyPersistentEngine::store_if
                           EventuallyPersistentEngine::storeIfInner
                           KVBucket::set
                           VBucket::set
                           DcpConnMap::notifyVBConnections
                           DcpProducer::notifySeqnoAvailable
                           cb::RWLock::readerUnlock
                           cb_rw_reader_exit
                           pthread_rwlock_unlock
             
                       0.00%   18.18%    0.00%    0.00%                0x1c      0x7f9c23c67608         0       155       174        9         8  [.] pthread_rwlock_unlock  libpthread-2.17.so  .:0           0{ 8  66.7%    n/a}  1{ 8  33.3%    n/a}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           |
                           |--0.00%--StateMachine::conn_execute
                           |          Cookie::execute
                           |          execute_client_request_packet
                           |          set_executor
                           |          SteppableCommandContext::drive
                           |          MutationCommandContext::step
                           |          MutationCommandContext::storeItem
                           |          bucket_store_if
                           |          EventuallyPersistentEngine::store_if
                           |          EventuallyPersistentEngine::storeIfInner
                           |          KVBucket::set
                           |          VBucket::set
                           |          DcpConnMap::notifyVBConnections
                           |          DcpProducer::notifySeqnoAvailable
                           |          cb::RWLock::readerUnlock
                           |          cb_rw_reader_exit
                           |          pthread_rwlock_unlock
                           |
                            --0.00%--StateMachine::conn_ship_log
                                      non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
                                      DcpProducer::step
                                      DcpProducer::getNextItem
                                      cb::RWLock::readerUnlock
                                      cb_rw_reader_exit
                                      pthread_rwlock_unlock
             
                       0.00%   12.12%    0.00%    0.00%                0x1c      0x7f9c23c670ba         0       249         0        4         4  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0                                1{ 4 100.0%    n/a}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           StateMachine::conn_execute
                           Cookie::execute
                           execute_client_request_packet
                           set_executor
                           SteppableCommandContext::drive
                           MutationCommandContext::step
                           MutationCommandContext::storeItem
                           bucket_store_if
                           EventuallyPersistentEngine::store_if
                           EventuallyPersistentEngine::storeIfInner
                           KVBucket::set
                           VBucket::set
                           DcpConnMap::notifyVBConnections
                           DcpProducer::notifySeqnoAvailable
                           cb::RWLock::readerLock
                           cb_rw_reader_enter
                           pthread_rwlock_rdlock
             
                       9.09%    0.00%    0.00%    0.00%                0x1c      0x7f9c23c670d5       382         0       173       11        12  [.] pthread_rwlock_rdlock  libpthread-2.17.so  .:0           0{12  50.0%    n/a}  1{12  50.0%    n/a}
                        |
                        ---__clone
                           start_thread
                           platform_thread_wrap
                           worker_libevent
                           event_base_loop
                           event_process_active_single_queue.isra.26
                           event_handler
                           run_event_loop
                           Connection::runEventLoop
                           Connection::runStateMachinery
                           StateMachine::conn_execute
                           Cookie::execute
                           execute_client_request_packet
                           set_executor
                           SteppableCommandContext::drive
                           MutationCommandContext::step
                           MutationCommandContext::storeItem
                           bucket_store_if
                           EventuallyPersistentEngine::store_if
                           EventuallyPersistentEngine::storeIfInner
                           KVBucket::set
                           VBucket::set
                           DcpConnMap::notifyVBConnections
                           DcpProducer::notifySeqnoAvailable
                           cb::RWLock::readerLock
                           cb_rw_reader_enter
                           pthread_rwlock_rdlock
            

            ben.huddleston Ben Huddleston added a comment - - edited Got some info on the RWLock contentious cache line using the callgraph option of perf c2c. ------------------------------------------------------------- 3 22 33 724 8 0x7f9b4e523d80 ------------------------------------------------------------- 27.27% 18.18% 47.38% 62.50% 0x0 0x7f9c23c675a7 1233 767 431 562 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 58.3% 64.7%} 1{40 41.7% 35.3%} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--45.62%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerUnlock | cb_rw_reader_exit | pthread_rwlock_unlock | --1.85%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerUnlock cb_rw_reader_exit pthread_rwlock_unlock   22.73% 0.00% 10.64% 0.00% 0x0 0x7f9c23c67600 599 0 383 248 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 40.0% 62.3%} 1{40 60.0% 37.7%} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--10.20%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerUnlock | cb_rw_reader_exit | pthread_rwlock_unlock | --0.31%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerUnlock cb_rw_reader_exit pthread_rwlock_unlock   4.55% 9.09% 2.62% 0.00% 0x0 0x7f9c23c6a42f 453 476 256 75 34 [.] __lll_lock_wait libpthread-2.17.so .:0 0{34 75.0% 52.6%} 1{34 25.0% 47.4%} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--2.54%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | | | |--1.89%--cb::RWLock::readerLock | | cb_rw_reader_enter | | pthread_rwlock_rdlock | | __lll_lock_wait | | | --0.65%--cb::RWLock::readerUnlock | cb_rw_reader_exit | pthread_rwlock_unlock | __lll_lock_wait | --0.31%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem | |--0.31%--cb::RWLock::readerLock | cb_rw_reader_enter | pthread_rwlock_rdlock | __lll_lock_wait | --0.00%--cb::RWLock::readerUnlock cb_rw_reader_exit pthread_rwlock_unlock __lll_lock_wait   4.55% 6.06% 9.25% 0.00% 0x0 0x7f9c23c670b1 789 1004 261 304 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 100.0% 67.2%} 1{40 0.0% 32.8%} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--8.33%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerLock | cb_rw_reader_enter | pthread_rwlock_rdlock | --0.64%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerLock cb_rw_reader_enter pthread_rwlock_rdlock   0.00% 6.06% 28.59% 12.50% 0x0 0x7f9c23c6705b 0 199 246 451 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 100.0% 61.1%} 1{40 0.0% 38.9%} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--26.42%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerLock | cb_rw_reader_enter | pthread_rwlock_rdlock | --1.83%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerLock cb_rw_reader_enter pthread_rwlock_rdlock   0.00% 3.03% 0.14% 0.00% 0x4 0x7f9c23c670ac 0 135 184 200 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 100.0% 100.0%} 1{40 0.0% 0.0%} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--0.20%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerLock | cb_rw_reader_enter | pthread_rwlock_rdlock | --0.00%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerLock cb_rw_reader_enter pthread_rwlock_rdlock   4.55% 0.00% 1.24% 25.00% 0x4 0x7f9c23c675b3 591 0 184 59 32 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{32 0.0% 63.6%} 1{32 100.0% 36.4%} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--1.30%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerUnlock | cb_rw_reader_exit | pthread_rwlock_unlock | --0.14%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerUnlock cb_rw_reader_exit pthread_rwlock_unlock   4.55% 3.03% 0.00% 0.00% 0x14 0x7f9c23c67068 320 128 195 227 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 50.0% n/a} 1{40 50.0% n/a} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--0.06%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerLock | cb_rw_reader_enter | pthread_rwlock_rdlock | --0.00%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerLock cb_rw_reader_enter pthread_rwlock_rdlock   18.18% 12.12% 0.00% 0.00% 0x18 0x7f9c23c67061 379 108 197 215 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 75.0% n/a} 1{40 25.0% n/a} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--0.05%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerLock | cb_rw_reader_enter | pthread_rwlock_rdlock | --0.00%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerLock cb_rw_reader_enter pthread_rwlock_rdlock   4.55% 12.12% 0.00% 0.00% 0x18 0x7f9c23c675ad 521 227 192 57 33 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{33 80.0% n/a} 1{33 20.0% n/a} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--0.01%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerUnlock | cb_rw_reader_exit | pthread_rwlock_unlock | --0.00%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerUnlock cb_rw_reader_exit pthread_rwlock_unlock   0.00% 0.00% 0.14% 0.00% 0x18 0x7f9c23c675b8 0 0 0 1 1 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 1{ 1 n/a 100.0%} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery StateMachine::conn_execute Cookie::execute execute_client_request_packet set_executor SteppableCommandContext::drive MutationCommandContext::step MutationCommandContext::storeItem bucket_store_if EventuallyPersistentEngine::store_if EventuallyPersistentEngine::storeIfInner KVBucket::set VBucket::set DcpConnMap::notifyVBConnections DcpProducer::notifySeqnoAvailable cb::RWLock::readerUnlock cb_rw_reader_exit pthread_rwlock_unlock   0.00% 18.18% 0.00% 0.00% 0x1c 0x7f9c23c67608 0 155 174 9 8 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{ 8 66.7% n/a} 1{ 8 33.3% n/a} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery | |--0.00%--StateMachine::conn_execute | Cookie::execute | execute_client_request_packet | set_executor | SteppableCommandContext::drive | MutationCommandContext::step | MutationCommandContext::storeItem | bucket_store_if | EventuallyPersistentEngine::store_if | EventuallyPersistentEngine::storeIfInner | KVBucket::set | VBucket::set | DcpConnMap::notifyVBConnections | DcpProducer::notifySeqnoAvailable | cb::RWLock::readerUnlock | cb_rw_reader_exit | pthread_rwlock_unlock | --0.00%--StateMachine::conn_ship_log non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>) DcpProducer::step DcpProducer::getNextItem cb::RWLock::readerUnlock cb_rw_reader_exit pthread_rwlock_unlock   0.00% 12.12% 0.00% 0.00% 0x1c 0x7f9c23c670ba 0 249 0 4 4 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 1{ 4 100.0% n/a} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery StateMachine::conn_execute Cookie::execute execute_client_request_packet set_executor SteppableCommandContext::drive MutationCommandContext::step MutationCommandContext::storeItem bucket_store_if EventuallyPersistentEngine::store_if EventuallyPersistentEngine::storeIfInner KVBucket::set VBucket::set DcpConnMap::notifyVBConnections DcpProducer::notifySeqnoAvailable cb::RWLock::readerLock cb_rw_reader_enter pthread_rwlock_rdlock   9.09% 0.00% 0.00% 0.00% 0x1c 0x7f9c23c670d5 382 0 173 11 12 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{12 50.0% n/a} 1{12 50.0% n/a} | ---__clone start_thread platform_thread_wrap worker_libevent event_base_loop event_process_active_single_queue.isra.26 event_handler run_event_loop Connection::runEventLoop Connection::runStateMachinery StateMachine::conn_execute Cookie::execute execute_client_request_packet set_executor SteppableCommandContext::drive MutationCommandContext::step MutationCommandContext::storeItem bucket_store_if EventuallyPersistentEngine::store_if EventuallyPersistentEngine::storeIfInner KVBucket::set VBucket::set DcpConnMap::notifyVBConnections DcpProducer::notifySeqnoAvailable cb::RWLock::readerLock cb_rw_reader_enter pthread_rwlock_rdlock

            Tested using build 2384. This cache line is contentious.

              -------------------------------------------------------------
                  1     1209      839     1863        0            0x7cec80
              -------------------------------------------------------------
                      96.77%   97.74%    0.00%    0.00%                0x30            0x48fcbd       296       145       161     2625        40  [.] update_topkeys   memcached  unique_ptr.h:147    0{40  56.3%    n/a}  1{40  43.7%    n/a}
                       3.23%    2.26%  100.00%    0.00%                0x38            0x48e1c0       402       182       225     5269        40  [.] mcbp_add_header  memcached  atomic_base.h:514   0{40  63.8%  57.8%}  1{40  36.2%  42.2%}
            

            The issue is in Bucket.

            (gdb) pahole Bucket
            /* 179352     */ struct Bucket {
            /*   0   40 */    std::mutex mutex
            /*  40   48 */    std::condition_variable cond
            /*  88    4 */    unsigned int clients
            /*  92    1 */    std::atomic<Bucket::State> state
            /*  93    1 */    Bucket::Type type
            /*  94  101 */    char [101] name
            /* XXX 40 bit hole, try to pack */
            /* 200   96 */    std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers
            /* 296   24 */    std::vector<thread_stats, std::allocator<thread_stats> > stats
            /* 320 176504 */    Timings timings
            /* 176824  672 */    TimingHistogram subjson_operation_times
            /* 177496    8 */    std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys
            /* 177504 1704 */    std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters
            /* 179208   64 */    ClusterConfiguration clusterConfiguration
            /* 179272    8 */    unsigned long max_document_size
            /* 179280   56 */    std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures
            /* 179336    8 */    EngineIface * engine
            /* 179344    8 */    DcpIface * bucketDcp
                          }
            

            Not printing cache lines because timings is ludicrously big, but the topkeys unique ptr shares a cache line with the responseCounters array. The responseCounters array is full of atomics that are used to track number of responses of a certain type. The first (and almost certainly the hottest) is SUCCESS. This is sharing with topkeys and as topkeys is only read in the sample, this is certainly false sharing. Will put this just after name. This will share a cache line with the end of name and the start of engine_event_handlers.

            (gdb) pahole Bucket
            /* 179352     */ struct Bucket {
            /*   0   40 */    std::mutex mutex
            /*  40   48 */    std::condition_variable cond
            /*  88    4 */    unsigned int clients
            /*  92    1 */    std::atomic<Bucket::State> state
            /*  93    1 */    Bucket::Type type
            /*  94  101 */    char [101] name
            /* XXX 40 bit hole, try to pack */
            /* 200    8 */    std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys
            /* 208   96 */    std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers
            /* 304   24 */    std::vector<thread_stats, std::allocator<thread_stats> > stats
            /* 328 176504 */    Timings timings
            /* 176832  672 */    TimingHistogram subjson_operation_times
            /* 177504 1704 */    std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters
            /* 179208   64 */    ClusterConfiguration clusterConfiguration
            /* 179272    8 */    unsigned long max_document_size
            /* 179280   56 */    std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures
            /* 179336    8 */    EngineIface * engine
            /* 179344    8 */    DcpIface * bucketDcp
                          }
            

            ben.huddleston Ben Huddleston added a comment - Tested using build 2384. This cache line is contentious. ------------------------------------------------------------- 1 1209 839 1863 0 0x7cec80 ------------------------------------------------------------- 96.77% 97.74% 0.00% 0.00% 0x30 0x48fcbd 296 145 161 2625 40 [.] update_topkeys memcached unique_ptr.h:147 0{40 56.3% n/a} 1{40 43.7% n/a} 3.23% 2.26% 100.00% 0.00% 0x38 0x48e1c0 402 182 225 5269 40 [.] mcbp_add_header memcached atomic_base.h:514 0{40 63.8% 57.8%} 1{40 36.2% 42.2%} The issue is in Bucket. (gdb) pahole Bucket /* 179352 */ struct Bucket { /* 0 40 */ std::mutex mutex /* 40 48 */ std::condition_variable cond /* 88 4 */ unsigned int clients /* 92 1 */ std::atomic<Bucket::State> state /* 93 1 */ Bucket::Type type /* 94 101 */ char [101] name /* XXX 40 bit hole, try to pack */ /* 200 96 */ std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers /* 296 24 */ std::vector<thread_stats, std::allocator<thread_stats> > stats /* 320 176504 */ Timings timings /* 176824 672 */ TimingHistogram subjson_operation_times /* 177496 8 */ std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys /* 177504 1704 */ std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters /* 179208 64 */ ClusterConfiguration clusterConfiguration /* 179272 8 */ unsigned long max_document_size /* 179280 56 */ std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures /* 179336 8 */ EngineIface * engine /* 179344 8 */ DcpIface * bucketDcp } Not printing cache lines because timings is ludicrously big, but the topkeys unique ptr shares a cache line with the responseCounters array. The responseCounters array is full of atomics that are used to track number of responses of a certain type. The first (and almost certainly the hottest) is SUCCESS. This is sharing with topkeys and as topkeys is only read in the sample, this is certainly false sharing. Will put this just after name. This will share a cache line with the end of name and the start of engine_event_handlers. (gdb) pahole Bucket /* 179352 */ struct Bucket { /* 0 40 */ std::mutex mutex /* 40 48 */ std::condition_variable cond /* 88 4 */ unsigned int clients /* 92 1 */ std::atomic<Bucket::State> state /* 93 1 */ Bucket::Type type /* 94 101 */ char [101] name /* XXX 40 bit hole, try to pack */ /* 200 8 */ std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys /* 208 96 */ std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers /* 304 24 */ std::vector<thread_stats, std::allocator<thread_stats> > stats /* 328 176504 */ Timings timings /* 176832 672 */ TimingHistogram subjson_operation_times /* 177504 1704 */ std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters /* 179208 64 */ ClusterConfiguration clusterConfiguration /* 179272 8 */ unsigned long max_document_size /* 179280 56 */ std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures /* 179336 8 */ EngineIface * engine /* 179344 8 */ DcpIface * bucketDcp }

            I think the rwlock cache line is normal contention when trying to acquire the rwlock guarding the streams map in the DcpProducer. The only other option would be for it to be contention on one of the locks in a StreamContainer, however we should see a contentious cache line for each lock (1 per StreamContainer i.e. per active vb) if that were the case. As a possible solution, we could "fix" this by removing the lock guarding the streams map. Each StreamContainer has it's own rwlock that we could rely on instead, splitting the contention over a number of cache lines equal to the number of active vbs. Would require a bit of rework to allow a StreamContainer to exist empty, and we would have to pre-populate the streams map with every vb and an empty StreamContainer.

            ben.huddleston Ben Huddleston added a comment - I think the rwlock cache line is normal contention when trying to acquire the rwlock guarding the streams map in the DcpProducer. The only other option would be for it to be contention on one of the locks in a StreamContainer, however we should see a contentious cache line for each lock (1 per StreamContainer i.e. per active vb) if that were the case. As a possible solution, we could "fix" this by removing the lock guarding the streams map. Each StreamContainer has it's own rwlock that we could rely on instead, splitting the contention over a number of cache lines equal to the number of active vbs. Would require a bit of rework to allow a StreamContainer to exist empty, and we would have to pre-populate the streams map with every vb and an empty StreamContainer.
            drigby Dave Rigby added a comment -

            Ben Huddleston This MB is getting pretty long (and has diverged from the original problem). Suggest you resolve this one (assuming the DcpProducer layout change did fix the initial regression and it's just we have new issues now) - and open new MB(s) for each specific new issue.

            drigby Dave Rigby added a comment - Ben Huddleston This MB is getting pretty long (and has diverged from the original problem). Suggest you resolve this one (assuming the DcpProducer layout change did fix the initial regression and it's just we have new issues now) - and open new MB(s) for each specific new issue.

            Created a few separate issues to track performance improvements I've identified as part of looking into perf c2c data and the DcpProducer in general. (MB-33151, MB-33153, MB-33157).

            The fix for this regression did bring performance back to the old value, however by the time it was merged the "long pole" in the DcpProducer has shifted so the fix no longer improves performance. Will fix the other issues mentioned and see if performance improves to previous values. MB-32389 is a bucket ticket for a general perf regression in kv_engine which may be of interest. Will resolve this issue as the regression has been fixed (the contentious cache line which was introduced is no longer present as of build 2202).

            ben.huddleston Ben Huddleston added a comment - Created a few separate issues to track performance improvements I've identified as part of looking into perf c2c data and the DcpProducer in general. ( MB-33151 , MB-33153 , MB-33157 ). The fix for this regression did bring performance back to the old value, however by the time it was merged the "long pole" in the DcpProducer has shifted so the fix no longer improves performance. Will fix the other issues mentioned and see if performance improves to previous values. MB-32389 is a bucket ticket for a general perf regression in kv_engine which may be of interest. Will resolve this issue as the regression has been fixed (the contentious cache line which was introduced is no longer present as of build 2202).

            Technically this regression is fixed. The symptom found in a perf c2c profile is no longer present. The long pole for perf in the write heavy test has shifted since this regression was fixed however so the perf numbers have not improved.

            ben.huddleston Ben Huddleston added a comment - Technically this regression is fixed. The symptom found in a perf c2c profile is no longer present. The long pole for perf in the write heavy test has shifted since this regression was fixed however so the perf numbers have not improved.

            People

              ben.huddleston Ben Huddleston
              drigby Dave Rigby
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty