Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Done
-
master
-
Untriaged
-
Yes
Description
As reported by http://showfast.sc.couchbase.com/daily/#/history/KV%7CPillowfight,%2020/80%20R/W,%20256B%20binary%20items%7CMax%20Throughput%20(ops/sec) :
Appears to have occurred between builds 6.5.0-1545 and 6.5.0-1561
Build | Throughput |
---|---|
6.5.0-1561 | 1,916,444 |
6.5.0-1545 | 2,009,289 |
Attachments
Issue Links
- relates to
-
MB-32389 10% drop in KV throughput over time
-
- Closed
-
Activity
Ben Huddleston Could you take a look? Note you can trigger intermediate builds (assuming they exist) by re-running the "Logs" Jenkins link and modifying the build number (e.g. http://perf.jenkins.couchbase.com/job/triton/20192/)
I ran the test against most of the intermediate builds, a few of the builds failed the test, but we have more information. Unfortunately the stats are a bit noisy, possible variations of at least 40k in ops/s.
From the latest set of runs:
Build | ops/s |
---|---|
6.5.0-1561 | 1,954,656 |
6.5.0-1560 | 1,946,356 |
6.5.0-1559 | 1,951,419 |
6.5.0-1558 | 1,938,901 |
6.5.0-1553 | 1,995,609 |
6.5.0-1552 | 1,997,253 |
6.5.0-1551 | 1,979,985 |
6.5.0-1549 | 2,012,679 |
6.5.0-1545 | 2,009,289 |
This would indicate the regression was likely introduced between build 6.5.0-1533 and 6.5.0-1588. Unfortunately the builds in between fail the test. Changelog between the two builds - http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-1553&tob=6.5.0-1558.
Nothing stands out as an obvious performance regression here, many of the changes between these builds are for infrequently run commands that would not affect this test. Ran through the list with Dave Rigby and http://review.couchbase.org/#/c/101191/ looks like a potential candidate for such a regression, the theory being that this could be a cache alignment issue due to the change in size of the ConnHandler class. The ConnHandler class is abstract, so we should also look at the DcpConsumer and DcpProducer classes which inherit from it. See attached files for full class layouts. connhandlernew.txt connhandlerold.txt
dcpconsumernew.txt
dcpconsumerold.txt
dcpproducernew.txt
dcpproducerold.txt
These layouts were dumped using the following command after inserting a global variable that gets the size of the desired class. Not all of the includes may be necessary. They were dumped in macOS not Linux.
clang++ -Xclang -fdump-record-layouts -I ../../../../cmake-build-debug/kv_engine/engines/ep/src/ -I ../src -I ../../../utilities/ -I ../../../../cmake-build-debug/tlm/deps/boost.exploded/include/ -I ../../../../cmake-build-debug/tlm/deps/json.exploded/include/ -I ../../../../third_party/HdrHistogram_c/src/ -I ../../../include/ -I ../../../../cmake-build-debug/platform/include/ -I ../../../../third_party/gsl-lite/include/ -I ../../../../platform/include/ -I ../../../../cmake-build-debug/kv_engine/include/ -I ../../../../third_party/spdlognew/include/ -I ../../../ -I ../../../../phosphor/include/ -std=c++14 dcp/producer.h
|
const static size_t size = sizeof(DcpProducer);
|
Â
Class | Old Size | New Size |
---|---|---|
ConnHandler | 176 | 88 |
DcpConsumer | 904 | 816 |
DcpProducer | 1280 | 1152 |
Interestingly, I found that DcpProducer decreased by a size different to that of ConnHandler and DcpConsumer (128 bytes vs 88 bytes). After some digging, I found that the cb::CachelinePadded<struct DcpProducer::CheckpointCreator> was aligned differently, which led me to MB-30488. Will try to investigate further using the tools used in MB-30488.
Spent a chunk of time looking at this without updating here. Lets start with some numbers.
 1553 ops/s | 1558 ops/s |
---|---|
1962776 | 1965001 |
1956093 | 1885716 |
1952747 | 1938901 |
1979286 | 1935830 |
2017269 | 1932842 |
1975075 | 1896938 |
1995609 | 1930222 |
1952298 | 1906424 |
Averages | ======== |
1,973,894.125 | 1,923,984.25 |
So we have a drop of 50k ops/s between 1553 and 1558 over the 8 runs performed. A refactor to ConnHandler was the suspect commit (http://review.couchbase.org/#/c/101191/). Created a toy build to test this out.
 1553 ops/s | 1558 ops/s | Suspect Commit ops/s |
---|---|---|
1962776 | 1965001 | 1931517 |
1956093 | 1885716 | 1931540 |
1952747 | 1938901 | 1933941 |
1979286 | 1935830 | 1959403 |
2017269 | 1932842 | 1934353 |
1975075 | 1896938 | |
1995609 | 1930222 | |
1952298 | 1906424 | |
Averages | ======== | ======== |
1,973,894.125 | 1,923,984.25 | 1,938,150.8 |
Only did 5 runs of this, but it looks like the regression is here. I found a fix for this after playing about with the layout of ConnHandler. Spotted that previously in the DcpProducer the shared from this ptr was sharing a cache line with the paused reason string but not the mutex. This was rolled into an atomic that now sits in a cache line with the start of the DcpProducer. Decided to start by separating these. Couple of ways of doing this, can dump the layout of a class using clang or gcc if you modify a class a little bit but you then have to work out how this will fall within cache lines. Looked into pahole which doesn't work easily on mancouch with our binaries. Pahole gdb works out of the box and is easy to add to gdb. It's also just a python script so I forked it and modified it to print in the cache lines for me. You can find it at https://github.com/BenHuddleston/pahole-gdb/blob/master/pahole.py.
ConnHandler before
(gdb) pahole ConnHandler -c
|
Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
|
/* 96 */ struct ConnHandler {
|
/* 0 8 */ int (**)(void) _vptr.ConnHandler
|
/* 8 8 */ EventuallyPersistentEngine & engine_
|
/* 16 8 */ EPStats & stats
|
/* 24 16 */ std::shared_ptr<BucketLogger> logger
|
/* 40 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name
|
========================
|
/* 64 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name
|
/* 72 8 */ std::atomic<void*> cookie
|
/* 80 1 */ std::atomic<bool> reserved
|
/* XXX 24 bit hole, try to pack */
|
/* 84 4 */ std::atomic<unsigned int> created
|
/* 88 1 */ std::atomic<bool> disconnect
|
/* 89 1 */ std::atomic<bool> supportAck
|
/* 90 1 */ std::atomic<bool> paused
|
/* XXX 8 bit hole, try to pack */
|
/* 92 4 */ std::atomic<ConnHandler::PausedReason> reason
|
}
|
DcpProducer before
(gdb) pahole DcpProducer -c
|
Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
|
/* 896 */ struct DcpProducer {
|
/* 0 96 */ ConnHandler ConnHandler
|
========================
|
/* 64 96 */ ConnHandler ConnHandler
|
/* 96 16 */ std::enable_shared_from_this<DcpProducer> std::enable_shared_from_this<DcpProducer>
|
/* 112 24 */ struct {...} noopCtx
|
========================
|
/* 128 24 */ struct {...} noopCtx
|
/* 136 4 */ Couchbase::RelaxedAtomic<unsigned int> lastReceiveTime
|
/* XXX 32 bit hole, try to pack */
|
/* 144 8 */ std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > rejectResp
|
/* 152 1 */ bool notifyOnly
|
/* 153 1 */ Couchbase::RelaxedAtomic<bool> enableExtMetaData
|
/* 154 1 */ Couchbase::RelaxedAtomic<bool> forceValueCompression
|
/* 155 1 */ Couchbase::RelaxedAtomic<bool> supportsCursorDropping
|
/* 156 1 */ Couchbase::RelaxedAtomic<bool> sendStreamEndOnClientStreamClose
|
/* 157 1 */ Couchbase::RelaxedAtomic<bool> supportsHifiMFU
|
/* 158 1 */ Couchbase::RelaxedAtomic<bool> enableExpiryOpcode
|
/* XXX 8 bit hole, try to pack */
|
/* 160 4 */ Couchbase::RelaxedAtomic<unsigned int> lastSendTime
|
/* XXX 32 bit hole, try to pack */
|
/* 168 88 */ DcpProducer::BufferLog log
|
========================
|
/* 192 88 */ DcpProducer::BufferLog log
|
========================
|
/* 256 16 */ std::shared_ptr<BackfillManager> backfillMgr
|
/* 272 176 */ DcpReadyQueue ready
|
========================
|
/* 320 176 */ DcpReadyQueue ready
|
========================
|
/* 384 176 */ DcpReadyQueue ready
|
========================
|
/* 448 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
|
========================
|
/* 512 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
|
/* 560 8 */ std::atomic<unsigned long> itemsSent
|
/* 568 8 */ std::atomic<unsigned long> totalBytesSent
|
========================
|
/* 576 8 */ std::atomic<unsigned long> totalUncompressedDataSize
|
/* XXX 448 bit hole, try to pack */
|
/* 640 128 */ cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator
|
========================
|
/* 704 128 */ cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator
|
========================
|
/* 768 1 */ IncludeValue includeValue
|
/* 769 1 */ IncludeXattrs includeXattrs
|
/* 770 1 */ IncludeDeleteTime includeDeleteTime
|
/* 771 1 */ const bool createChkPtProcessorTsk
|
}
|
ConnHandler was a different size than on MacOS. Going to pad ConnHandler out to the end of the cache line it sits in. Note the hole in ConnHandler, this can be filled by moving the reserved member.
ConnHandler after
(gdb) pahole ConnHandler -c
|
Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
|
/* 128 */ struct ConnHandler {
|
/* 0 8 */ int (**)(void) _vptr.ConnHandler
|
/* 8 8 */ EventuallyPersistentEngine & engine_
|
/* 16 8 */ EPStats & stats
|
/* 24 16 */ std::shared_ptr<BucketLogger> logger
|
/* 40 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name
|
========================
|
/* 64 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > name
|
/* 72 8 */ std::atomic<void*> cookie
|
/* 80 4 */ std::atomic<unsigned int> created
|
/* 84 1 */ std::atomic<bool> reserved
|
/* 85 1 */ std::atomic<bool> disconnect
|
/* 86 1 */ std::atomic<bool> supportAck
|
/* 87 1 */ std::atomic<bool> paused
|
/* 88 4 */ std::atomic<ConnHandler::PausedReason> reason
|
/* 92 36 */ uint8_t [36] padding
|
========================
|
}
|
DcpProducer after
(gdb) pahole DcpProducer -c
|
Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
|
/* 896 */ struct DcpProducer {
|
/* 0 128 */ ConnHandler ConnHandler
|
========================
|
/* 64 128 */ ConnHandler ConnHandler
|
========================
|
/* 128 16 */ std::enable_shared_from_this<DcpProducer> std::enable_shared_from_this<DcpProducer>
|
/* 144 24 */ struct {...} noopCtx
|
/* 168 4 */ Couchbase::RelaxedAtomic<unsigned int> lastReceiveTime
|
/* XXX 32 bit hole, try to pack */
|
/* 176 8 */ std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > rejectResp
|
/* 184 1 */ bool notifyOnly
|
/* 185 1 */ Couchbase::RelaxedAtomic<bool> enableExtMetaData
|
/* 186 1 */ Couchbase::RelaxedAtomic<bool> forceValueCompression
|
/* 187 1 */ Couchbase::RelaxedAtomic<bool> supportsCursorDropping
|
/* 188 1 */ Couchbase::RelaxedAtomic<bool> sendStreamEndOnClientStreamClose
|
/* 189 1 */ Couchbase::RelaxedAtomic<bool> supportsHifiMFU
|
/* 190 1 */ Couchbase::RelaxedAtomic<bool> enableExpiryOpcode
|
/* XXX 8 bit hole, try to pack */
|
/* 192 4 */ Couchbase::RelaxedAtomic<unsigned int> lastSendTime
|
/* XXX 32 bit hole, try to pack */
|
/* 200 88 */ DcpProducer::BufferLog log
|
========================
|
/* 256 88 */ DcpProducer::BufferLog log
|
/* 288 16 */ std::shared_ptr<BackfillManager> backfillMgr
|
/* 304 176 */ DcpReadyQueue ready
|
========================
|
/* 320 176 */ DcpReadyQueue ready
|
========================
|
/* 384 176 */ DcpReadyQueue ready
|
========================
|
/* 448 176 */ DcpReadyQueue ready
|
/* 480 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
|
========================
|
/* 512 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
|
========================
|
/* 576 112 */ AtomicUnorderedMap<Vbid, std::shared_ptr<Stream>, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<std::pair<Vbid const, std::shared_ptr<Stream> > > > streams
|
/* 592 8 */ std::atomic<unsigned long> itemsSent
|
/* 600 8 */ std::atomic<unsigned long> totalBytesSent
|
/* 608 8 */ std::atomic<unsigned long> totalUncompressedDataSize
|
/* XXX 192 bit hole, try to pack */
|
/* 640 128 */ cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator
|
========================
|
/* 704 128 */ cb::CachelinePadded<DcpProducer::CheckpointCreator> checkpointCreator
|
========================
|
/* 768 1 */ IncludeValue includeValue
|
/* 769 1 */ IncludeXattrs includeXattrs
|
/* 770 1 */ IncludeDeleteTime includeDeleteTime
|
/* 771 1 */ const bool createChkPtProcessorTsk
|
}
|
Results from testing this (http://review.couchbase.org/#/c/103899/)
 1553 ops/s | 1558 ops/s | Suspect Commit ops/s | Padded ops/s |
---|---|---|---|
1962776 | 1965001 | 1931517 | 1967325 |
1956093 | 1885716 | 1931540 | 1961733 |
1952747 | 1938901 | 1933941 | 1982247 |
1979286 | 1935830 | 1959403 | 1996264 |
2017269 | 1932842 | 1934353 | 1991822 |
1975075 | 1896938 | 1960995 | |
1995609 | 1930222 | 2005140 | |
1952298 | 1906424 | 1998894 | |
Averages | ======== | ======== | ======== |
1,973,894.125 | 1,923,984.25 | 1,938,150.8 | 1,983,052.5 |
I also tried padding holes in the DcpProducer and separating the DcpReadyQueue and StreamsMap (http://review.couchbase.org/#/c/103971/3/engines/ep/src/dcp/producer.h)but this regressed performance.
Ben Huddleston Nice!
I was looking for the hot-path that could cause the regression, and this could be the one:
void DcpProducer::scheduleNotify() {
|
engine_.getDcpConnMap().addConnectionToPending(shared_from_this());
|
}
|
At regression, when we write into 'std::atomic<ConnHandler::PausedReason> reason' the cache-line is invalidated. Then the call to shared_from_this() goes to memory.
I believe that here the invalidation happens even when we just read 'reason', given that it is std::atomic. So it happens quite often.
As you said in your comment above:
Spotted that previously in the DcpProducer the shared from this ptr was sharing a cache line with the paused reason string but not the mutex
Now I see that with your change you have even improved compared to pre-regression. That's probably because now there's no sharing at all.
At regression, when we write into 'std::atomic<ConnHandler::PausedReason> reason' the cache-line is invalidated. Then the call to shared_from_this() goes to memory.
I believe that here the invalidation happens even when we just read 'reason', given that it is std::atomic. So it happens quite often.
That's not the case - a cacheline is only marked dirty when it is modified - even for atomic ops. Note it's also perfectly possible for the same cacheline to reside in the clean state in different CPU caches.
Looks like the cause of this one is the moving of the StreamsMap in the DcpProducer.
From build 1553 + the suspect commit we have the following perf c2c data. Re-ordered so that it follows the in memory layout (these two cache lines are contiguous).
#
|
# ----- HITM ----- -- Store Refs -- Data address ---------- cycles ---------- Total cpu Shared
|
# Num Rmt Lcl L1 Hit L1 Miss Offset Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node{cpus %hitms %stores}
|
# ..... ....... ....... ....... ....... .................. .................. ........ ........ ........ ....... ........ ............................................................................ ......... ..................... ....
|
#
|
-------------------------------------------------------------
|
4 186 208 4098 54 0x7f3f9b378a40
|
-------------------------------------------------------------
|
4.30% 8.65% 0.00% 0.00% 0x0 0x7f3f702095c5 291 174 165 621 40 [.] DcpProducer::findStream ep.so hashtable.h:1541 0{40 53.8% n/a} 1{40 46.2% n/a}
|
37.10% 43.27% 0.00% 0.00% 0x8 0x7f3f702095b1 295 187 172 928 40 [.] DcpProducer::findStream ep.so rwlock.h:66 0{40 66.7% n/a} 1{40 33.3% n/a}
|
13.98% 12.02% 41.95% 77.78% 0x38 0x7f3f9d5fb5a7 975 672 461 2860 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 58.8% 62.6%} 1{40 41.2% 37.4%}
|
12.90% 8.65% 10.32% 0.00% 0x38 0x7f3f9d5fb600 953 423 371 1343 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 47.6% 62.6%} 1{40 52.4% 37.4%}
|
9.14% 4.81% 9.44% 0.00% 0x38 0x7f3f9d5fb0b1 781 443 260 1540 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 77.8% 66.4%} 1{40 22.2% 33.6%}
|
3.76% 4.81% 33.19% 1.85% 0x38 0x7f3f9d5fb05b 632 389 243 2597 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 52.9% 60.8%} 1{40 47.1% 39.2%}
|
1.61% 2.40% 3.07% 0.00% 0x38 0x7f3f9d5fe42f 906 526 233 401 40 [.] __lll_lock_wait libpthread-2.17.so .:0 0{40 37.5% 57.1%} 1{40 62.5% 42.9%}
|
0.00% 0.00% 0.02% 0.00% 0x38 0x7f3f9d5fe4b2 0 0 0 1 1 [.] __lll_unlock_wake libpthread-2.17.so .:0 1{ 1 n/a 100.0%}
|
9.14% 8.17% 1.10% 16.67% 0x3c 0x7f3f9d5fb5b3 436 174 187 273 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 61.8% 63.0%} 1{40 38.2% 37.0%}
|
8.06% 7.21% 0.90% 3.70% 0x3c 0x7f3f9d5fb0ac 377 137 194 1077 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 73.3% 61.5%} 1{40 26.7% 38.5%}
|
-------------------------------------------------------------
|
2 396 259 71 14 0x7f3f9b378a80
|
-------------------------------------------------------------
|
0.25% 0.00% 0.00% 0.00% 0x8 0x7f3f9d5fb5ce 238 0 185 32 22 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{22 100.0% n/a} 1{22 0.0% n/a}
|
6.31% 3.47% 0.00% 0.00% 0xc 0x7f3f9d5fb068 304 128 179 968 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 61.8% n/a} 1{40 38.2% n/a}
|
88.13% 93.82% 0.00% 0.00% 0x10 0x7f3f9d5fb061 284 121 119 884 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 55.2% n/a} 1{40 44.8% n/a}
|
3.03% 1.54% 0.00% 0.00% 0x10 0x7f3f9d5fb5ad 277 144 128 117 39 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{39 43.8% n/a} 1{39 56.2% n/a}
|
0.00% 0.00% 2.82% 100.00% 0x10 0x7f3f9d5fb5b8 0 0 0 16 14 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{14 n/a 56.2%} 1{14 n/a 43.8%}
|
1.26% 1.16% 0.00% 0.00% 0x14 0x7f3f9d5fb0ba 416 370 179 29 18 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{18 50.0% n/a} 1{18 50.0% n/a}
|
0.76% 0.00% 0.00% 0.00% 0x14 0x7f3f9d5fb612 240 0 246 20 16 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{16 66.7% n/a} 1{16 33.3% n/a}
|
0.00% 0.00% 43.66% 0.00% 0x30 0x7f3f7020d4ae 0 0 337 73 27 [.] DcpProducer::step ep.so atomic_base.h:514 0{27 n/a 45.2%} 1{27 n/a 54.8%}
|
0.25% 0.00% 53.52% 0.00% 0x38 0x7f3f7020d4c3 1003 0 299 86 27 [.] DcpProducer::step ep.so atomic_base.h:514 0{27 100.0% 78.9%} 1{27 0.0% 21.1%}
|
From the first cache line we can see that the hash table in the unordered map of the StreamsMap is sharing with the start of the mutex that locks access to it. All the reads from the hash table (first two columns) are contending with reads and writes to the mutex. This is false sharing. In the second cache line, the mutex members (mainly read) contend with writes to the atomic variables that follow (itemsSent and totalBytesSent), again, false sharing. To fix this, we can wrap the mutex in the streams map with some padding. This gave an average ops/s over 9 runs of 1,972,524. This is within 2k of build 1553 (last good build) which is back to pre-regression values.
In looking at this regression, the following cache line was of interest.
#
|
# ----- HITM ----- -- Store Refs -- Data address ---------- cycles ---------- Total cpu Shared
|
# Num Rmt Lcl L1 Hit L1 Miss Offset Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node{cpus %hitms %stores}
|
# ..... ....... ....... ....... ....... .................. .................. ........ ........ ........ ....... ........ ............................................................................ ......... ..................... ....
|
#
|
-------------------------------------------------------------
|
0 467 473 1377 0 0x7f3f9b378800
|
-------------------------------------------------------------
|
95.07% 91.54% 0.00% 0.00% 0x8 0x7f3f701d9a02 274 127 170 934 40 [.] ActiveStream::notifyStreamReady ep.so shared_ptr_base.h:202 0{40 57.2% n/a} 1{40 42.8% n/a}
|
1.50% 2.96% 47.57% 0.00% 0x8 0x7f3f701d9a10 707 362 277 1672 40 [.] ActiveStream::notifyStreamReady ep.so shared_ptr_base.h:294 0{40 61.9% 60.8%} 1{40 38.1% 39.2%}
|
1.71% 1.90% 49.60% 0.00% 0x8 0x4282be 972 570 286 1689 40 [.] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release memcached shared_ptr_base.h:151 0{40 70.6% 62.1%} 1{40 29.4% 37.9%}
|
1.28% 1.48% 0.00% 0.00% 0x8 0x7f3f701d95a1 266 111 206 29 19 [.] std::__shared_ptr<DcpProducer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr ep.so shared_ptr_base.h:202 0{19 69.2% n/a} 1{19 30.8% n/a}
|
0.43% 1.48% 0.00% 0.00% 0x8 0x7f3f701d9a16 382 205 285 1092 40 [.] ActiveStream::notifyStreamReady ep.so shared_ptr_base.h:202 0{40 66.7% n/a} 1{40 33.3% n/a}
|
0.00% 0.42% 0.00% 0.00% 0x8 0x7f3f701d95ba 0 487 217 44 24 [.] std::__shared_ptr<DcpProducer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr ep.so shared_ptr_base.h:202 0{24 100.0% n/a} 1{24 0.0% n/a}
|
0.00% 0.21% 2.83% 0.00% 0x8 0x7f3f701d95ab 0 180 200 77 29 [.] std::__shared_ptr<DcpProducer, (__gnu_cxx::_Lock_policy)2>::__shared_ptr ep.so shared_ptr_base.h:294 0{29 100.0% 79.5%} 1{29 0.0% 20.5%}
|
This cache line features heavily in every perf c2c profile (even in the good builds). This contention in this cache line is coming from the shared pointer control block of a DcpProducer. Every ActiveStream will attempt to call notifyStreamReady when it has more items to send which will attempt to lock it's weak pointer to the DcpProducer which increments the shared ref count in the shared pointer control block. The weak pointer in ActiveStream was converted from a shared pointer to prevent a memory leak if the ActiveStream was not released by a DcpProducer (the own references to each other). Will be looking into reverting this change.
Jim Walker spotted that we make a check in ActiveStream::notifyStreamReady after grabbing the shared pointer to the DcpProducer. We check if there are any items ready to send, and if not, do nothing, so grabbing a shared pointer to the DcpProducer is pointless in this case. Moving the locking of the ActiveStream's weak pointer into the condition (http://review.couchbase.org/#/c/104255/1/engines/ep/src/dcp/active_stream.cc), I ran another 8 perf tests. Results are as follows.
1553 | 1553 + Regression + Regression Fix | 1553 + Regression + Regression Fix + Swapping condition order |
---|---|---|
1962776 | 1978871 | 1991917 |
1956093 | 1977653 | 1978745 |
1952747 | 1945110 | 2016141 |
1979286 | 1997571 | 1965412 |
2017269 | 1951145 | 1967976 |
1975075 | 1964446 | 2010578 |
1995609 | 1988631 | 1985555 |
1952298 | 1969411 | 1956033 |
1979882 | ||
Averages | ======== | ======= |
1973894.125 | 1972524.444 | 1984044.625 |
Number of runs is relatively low so I wouldn't trust these averages with 100% certainty, but this change does appear to increase performance very slightly. I took some more perf c2c samples; this cache line previously featured in the top 3 most contended for every sample, it now does not feature at all within the samples so there is no longer any contention on this cache line.
Some other potential inefficiencies were spotted with the hot path in ActiveStream so I'll investigate a few more of those while this is fresh in the mind.
Build couchbase-server-6.5.0-2202 contains kv_engine commit 30b4908 with commit message:
MB-32107: Reduce false sharing of cache lines in DcpProducer
Build couchbase-server-6.5.0-2207 contains kv_engine commit 154d35e with commit message:
MB-32107: Reduce cache line contention by moving DcpProducer acquisition
Build couchbase-server-6.5.0-2207 contains kv_engine commit 154d35e with commit message:
MB-32107: Reduce cache line contention by moving DcpProducer acquisition
One other possible simplification spotted was to remove the recursive call to ActiveStream::processItems via ActiveStream::nextCheckpointItemTask (http://review.couchbase.org/#/c/104256/). This is only called if it can't find any items on this current iteration of the task. It seems wasteful to make multiple function calls to just check if there are any more items to process instead of waiting for the next task to run.
1553 | 1553 + Regression + Regression Fix | 1553 + Regression + Regression Fix + Removing recursive call |
---|---|---|
1962776 | 1978871 | 1970412 |
1956093 | 1977653 | 1996668 |
1952747 | 1945110 | 1994747 |
1979286 | 1997571 | 1956121 |
2017269 | 1951145 | 1983625 |
1975075 | 1964446 | 1966830 |
1995609 | 1988631 | 1994766 |
1952298 | 1969411 | 1969566 |
 | 1979882 |  |
Averages | ======== | ======== |
1973894.125 | 1972524.444 | 1979091.88 |
Given this is just an average for a relatively low number of runs I wouldn't expect this to actually given any noticeable increase in ops/s, however it doesn't look to have regressed performance and it makes the code marginally simpler.
Build couchbase-server-6.5.0-2229 contains kv_engine commit 13716fa with commit message:
MB-32107: Remove recursive nextCheckpointItemTask call in ActiveStream
Bit of a brain dump.
Would have expected the build 2202 to improve perf, however after running it repeatedly it does not. The fix actually seems to be making things worse. Unfortunately the previous ~20 builds were broken so this is becoming a bit of a pain to get perf numbers.
2202 | 2202 + a revert of the regression fix |
---|---|
 | 1840475 |
 | 1830449 |
1820631 | 1803297 |
1811079 | 1833983 |
1832017 | 1837549 |
1822782 | 1839338 |
1825298 | 1834332 |
1807678 | 1814464 |
1842862 | 1851247 |
1814444 | 1843619 |
1829581 | 1839287 |
Averages | ======== |
1822930.222 | 1833458.182 |
Profiled the build with 2202 + the revert using perf c2c again. This cache line is now pretty hot. It's another producer shared_ptr acquisition.
-------------------------------------------------------------
|
0 392 408 1173 0 0x7fde45841d00
|
-------------------------------------------------------------
|
94.39% 88.24% 0.00% 0.00% 0x8 0x7fde49b93e22 264 115 151 801 40 [.] ActiveStream::isCompressionEnabled ep.so
|
1.02% 6.62% 45.52% 0.00% 0x8 0x7fde49b93e30 971 395 271 1434 40 [.] ActiveStream::isCompressionEnabled ep.so
|
2.55% 2.21% 0.00% 0.00% 0x8 0x7fde49b93981 274 128 180 26 20 [.] fmt::v5::internal::format_decimal<unsigned long, char, fmt::v5::internal::no_thousands_sep> ep.so
|
2.04% 2.21% 51.49% 0.00% 0x8 0x4299fe 733 486 326 1576 40 [.] 0x00000000000299fe memcached
|
0.00% 0.49% 0.00% 0.00% 0x8 0x7fde49b93e36 0 442 281 929 40 [.] ActiveStream::isCompressionEnabled ep.so
|
0.00% 0.25% 2.98% 0.00% 0x8 0x7fde49b9398b 0 282 159 64 29 [.] fmt::v5::internal::format_decimal<unsigned long, char, fmt::v5::internal::no_thousands_sep> ep.so
|
I don't think I've seen this cache line in a profile before so may be of note.
-------------------------------------------------------------
|
2 394 252 343 0 0x7cfc80
|
-------------------------------------------------------------
|
96.70% 97.62% 0.00% 0.00% 0x30 0x48f23d 296 149 160 814 40 [.] 0x000000000008f23d memcached mcbp_executors.cc:479 0{40 58.7% n/a} 1{40 41.3% n/a}
|
3.30% 2.38% 100.00% 0.00% 0x38 0x48d7a0 437 337 237 1579 40 [.] mcbp_add_header memcached mcbp.cc:82 0{40 63.2% 63.3%} 1{40 36.8% 36.7%}
|
Also interested in this cache line. Not entirely sure what it's hitting.Â
-------------------------------------------------------------
|
4 105 123 3132 41 0x7fde45841fc0
|
-------------------------------------------------------------
|
16.19% 29.27% 45.88% 78.05% 0x0 0x7fde4f2d15a7 937 633 441 2432 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 75.5% 64.5%} 1{40 24.5% 35.5%}
|
11.43% 12.20% 10.38% 0.00% 0x0 0x7fde4f2d10b1 688 399 258 1294 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 51.9% 67.4%} 1{40 48.1% 32.6%}
|
10.48% 8.13% 10.60% 0.00% 0x0 0x7fde4f2d1600 660 403 377 1137 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 57.1% 61.1%} 1{40 42.9% 38.9%}
|
5.71% 2.44% 28.10% 2.44% 0x0 0x7fde4f2d105b 897 469 240 1917 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 55.6% 51.0%} 1{40 44.4% 49.0%}
|
3.81% 3.25% 2.30% 0.00% 0x0 0x7fde4f2d442f 403 608 244 265 40 [.] __lll_lock_wait libpthread-2.17.so .:0 0{40 62.5% 59.7%} 1{40 37.5% 40.3%}
|
2.86% 0.81% 1.21% 14.63% 0x4 0x7fde4f2d15b3 388 119 181 275 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 50.0% 68.2%} 1{40 50.0% 31.8%}
|
0.95% 1.63% 0.32% 4.88% 0x4 0x7fde4f2d10ac 254 556 188 946 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 66.7% 58.3%} 1{40 33.3% 41.7%}
|
0.00% 2.44% 0.00% 0.00% 0x14 0x7fde4f2d1068 0 177 196 920 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 66.7% n/a} 1{40 33.3% n/a}
|
19.05% 10.57% 0.00% 0.00% 0x18 0x7fde4f2d15ad 512 127 176 226 39 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{39 57.6% n/a} 1{39 42.4% n/a}
|
11.43% 6.50% 0.00% 0.00% 0x18 0x7fde4f2d1061 308 126 191 954 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 55.0% n/a} 1{40 45.0% n/a}
|
0.00% 0.00% 0.29% 0.00% 0x18 0x7fde4f2d15b8 0 0 0 9 7 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{ 7 n/a 33.3%} 1{ 7 n/a 66.7%}
|
6.67% 8.13% 0.00% 0.00% 0x1c 0x7fde4f2d10ba 396 216 332 21 17 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{17 47.1% n/a} 1{17 52.9% n/a}
|
3.81% 8.13% 0.00% 0.00% 0x1c 0x7fde4f2d1608 551 234 312 16 13 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{13 78.6% n/a} 1{13 21.4% n/a}
|
4.76% 2.44% 0.00% 0.00% 0x1c 0x7fde4f2d10d5 622 168 154 59 32 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{32 37.5% n/a} 1{32 62.5% n/a}
|
1.90% 3.25% 0.00% 0.00% 0x1c 0x7fde4f2d1612 541 139 164 32 25 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{25 50.0% n/a} 1{25 50.0% n/a}
|
0.95% 0.81% 0.93% 0.00% 0x38 0x7fde49bd90fd 298 140 226 68 29 [.] DcpProducer::step ep.so producer.cc:668 0{29 0.0% 75.9%} 1{29 100.0% 24.1%}
|
Going to keep looking into producer/active stream changes around this time.
re: ActiveStream::isCompressionEnabled:
bool ActiveStream::isCompressionEnabled() {
|
auto producer = producerPtr.lock();
|
if (producer) {
|
return producer->isCompressionEnabled();
|
}
|
/* If the 'producer' is deleted, what we return doesn't matter */
|
return false;
|
}
|
Given that the producer can only change it's compression via a dcp_control message, and for other similar control params we simply take the "current" value of that param to use when a stream is created (e.g. IncludeDeleteTime), one optimisation would be to handle compression in the same way - give ActiveStream it's own member variable set on construction which avoids having to lock the producer.
Technically this now means you cannot change compression enabled dynamiclly on an existing stream, but I don't think we ever documented that working before, and it's an internal-only setting anyways (used by Backup & XDCR).
FYI John Liang / Patrick Varley - would making force_value_compression only applicable to streams created after it (and no longer being dynamic) be a problem?
Backup does not use "force_value_compression" mode, as you advised against it a while back (I would like to try it at some point). It uses the normal DCP compression mode. backup also does not dynamic change DCP stream settings.
Other changes that may impact performance due to the dcp producer layout changing include:
- Build 2183 - Add consumer name to producer - http://review.couchbase.org/#/c/104133/
- Build 1977 - Change backfill manager from shared_ptr to a custom atomic shared pointer - http://review.couchbase.org/#/c/103209/
- Build 1818 - Change StreamsMap for multiple streams per vb - http://review.couchbase.org/#/c/101748/
Current perf runs from around build 1977 don't suggest any issue, in fact performance seems to increase with the next build tested (1987). Will rerun some tests around here and investigate number 2 if there is some regression.
I think we can rule out number 3 as far as layout is concerned, the actual map that should sit in the producer should not have changed, just the values inside it which will be off at some other memory locations. Will run pahole/ptype to confirm. May be worth considering cacheline padding the mutexes in StreamContainer though as they probably share with the forward list.
Of course, the issues could be non layout related.
Spotted that the mutex in the AtomicSharedPtr to the BackfillManager is straddling two cache lines post build 2183.
========================
|
/* 192 32 */ std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > consumerName
|
/* 200 4 */ Couchbase::RelaxedAtomic<unsigned int> lastSendTime
|
/* XXX 32 bit hole, try to pack */
|
/* 208 88 */ DcpProducer::BufferLog log
|
========================
|
/* 256 88 */ DcpProducer::BufferLog log
|
/* 296 56 */ cb::AtomicSharedPtr<BackfillManager> backfillMgr
|
========================
|
/* 320 56 */ cb::AtomicSharedPtr<BackfillManager> backfillMgr
|
/* 352 176 */ DcpReadyQueue ready
|
========================
|
(gdb) pahole -c cb::AtomicSharedPtr<BackfillManager>
|
Cache lines are printed every 64 bytes and are a guidance only that are subject to natural alignment of the object.
|
/* 56 */ struct cb::AtomicSharedPtr<BackfillManager> {
|
/* 0 40 */ std::mutex mutex
|
/* 40 16 */ std::shared_ptr<BackfillManager> ptr
|
}
|
What's it sharing with?
On the buffer log side it's now sharing with part of the RWLock, the producer reference, maxBytes, bytesOutstanding, and ackedBytes. Quite possibly an issue, but depends on the usage. Notably, this RWLock is now also straddling two cache lines.
(gdb) pahole DcpProducer::BufferLog
|
/* 88 */ struct DcpProducer::BufferLog {
|
/* 0 56 */ cb::RWLock logLock
|
/* 56 8 */ DcpProducer & producer
|
/* 64 8 */ unsigned long maxBytes
|
/* 72 8 */ cb::NonNegativeCounter<unsigned long, cb::ThrowExceptionUnderflowPolicy> bytesOutstanding
|
/* 80 8 */ Monotonic<unsigned long, ThrowExceptionPolicy, std::greater> ackedBytes
|
}
|
On the ready queue side, the mutex and the shared_ptr of the AtomicSharedPtr are sharing with part of the mutex in the DcpReadyQueue. Notably, this mutex is now also straddling a cache line.
(gdb) pahole DcpReadyQueue
|
/* 176 */ struct DcpReadyQueue {
|
/* 0 40 */ std::mutex lock
|
/* 40 80 */ std::queue<Vbid, std::deque<Vbid, std::allocator<Vbid> > > readyQueue
|
/* 120 56 */ std::unordered_set<Vbid, std::hash<Vbid>, std::equal_to<Vbid>, std::allocator<Vbid> > queuedValues
|
}
|
Will try to correlate this with the perf c2c data for build 2202.
Also confirmed there are no unexpected changes to layout of StreamsMap with the StreamContainer changes.
Ran a load of perf tests with slightly different fixes for the issues in DcpProducer.
2202 | 2202 + move consumer name to end | 2202 + cache line pad mutex in atomic shared ptr | 2202 + move string and pad atomic shared ptr | 2202 + cache line pad rw lock in buffer log | 2202 + pad atomic shared ptr and buffer log lock |
---|---|---|---|---|---|
1820631 | Â | Â | 1831132 | Â | Â |
1811079 | Â | 1844199 | 1821569 | 1838826 | 1864285 |
1832017 | 1829126 | 1856378 | 1846906 | 1852154 | 1817876 |
1822782 | 1839222 | 1837934 | 1848847 | 1807318 | 1840398 |
1825298 | 1825093 | 1841431 | 1842268 | 1802737 | 1820714 |
1807678 | 1835437 | 1841422 | 1822030 | 1814812 | 1827543 |
1842862 | 1852079 | 1812351 | 1847608 | 1868165 | 1797235 |
1814444 | 1855286 | 1825289 | 1842034 | 1837529 | 1845806 |
1829581 | 1833542 | 1850124 | 1844421 | 1842462 | 1830152 |
Averages | ======== | ======== | ======== | ======== | ======== |
1822930.222 | 1838540.714 | 1838641 | 1838535 | 1833000.375 | 1830501.125 |
I think the Producer might be starting to suffer from poor locality as performance seems to be dropping off with size now (could just be not enough data though). Would suggest cache line padding the mutex in atomic shared ptr as it's more robust than moving the consumer name to the end of the class. Still not getting back the ~50k op/s or so that we lost.
Got some info on the RWLock contentious cache line using the callgraph option of perf c2c.
-------------------------------------------------------------
|
3 22 33 724 8 0x7f9b4e523d80
|
-------------------------------------------------------------
|
27.27% 18.18% 47.38% 62.50% 0x0 0x7f9c23c675a7 1233 767 431 562 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 58.3% 64.7%} 1{40 41.7% 35.3%}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--45.62%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerUnlock
|
| cb_rw_reader_exit
|
| pthread_rwlock_unlock
|
|
|
--1.85%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerUnlock
|
cb_rw_reader_exit
|
pthread_rwlock_unlock
|
|
22.73% 0.00% 10.64% 0.00% 0x0 0x7f9c23c67600 599 0 383 248 40 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{40 40.0% 62.3%} 1{40 60.0% 37.7%}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--10.20%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerUnlock
|
| cb_rw_reader_exit
|
| pthread_rwlock_unlock
|
|
|
--0.31%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerUnlock
|
cb_rw_reader_exit
|
pthread_rwlock_unlock
|
|
4.55% 9.09% 2.62% 0.00% 0x0 0x7f9c23c6a42f 453 476 256 75 34 [.] __lll_lock_wait libpthread-2.17.so .:0 0{34 75.0% 52.6%} 1{34 25.0% 47.4%}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--2.54%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| |
|
| |--1.89%--cb::RWLock::readerLock
|
| | cb_rw_reader_enter
|
| | pthread_rwlock_rdlock
|
| | __lll_lock_wait
|
| |
|
| --0.65%--cb::RWLock::readerUnlock
|
| cb_rw_reader_exit
|
| pthread_rwlock_unlock
|
| __lll_lock_wait
|
|
|
--0.31%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
|
|
|--0.31%--cb::RWLock::readerLock
|
| cb_rw_reader_enter
|
| pthread_rwlock_rdlock
|
| __lll_lock_wait
|
|
|
--0.00%--cb::RWLock::readerUnlock
|
cb_rw_reader_exit
|
pthread_rwlock_unlock
|
__lll_lock_wait
|
|
4.55% 6.06% 9.25% 0.00% 0x0 0x7f9c23c670b1 789 1004 261 304 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 100.0% 67.2%} 1{40 0.0% 32.8%}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--8.33%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerLock
|
| cb_rw_reader_enter
|
| pthread_rwlock_rdlock
|
|
|
--0.64%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerLock
|
cb_rw_reader_enter
|
pthread_rwlock_rdlock
|
|
0.00% 6.06% 28.59% 12.50% 0x0 0x7f9c23c6705b 0 199 246 451 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 100.0% 61.1%} 1{40 0.0% 38.9%}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--26.42%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerLock
|
| cb_rw_reader_enter
|
| pthread_rwlock_rdlock
|
|
|
--1.83%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerLock
|
cb_rw_reader_enter
|
pthread_rwlock_rdlock
|
|
0.00% 3.03% 0.14% 0.00% 0x4 0x7f9c23c670ac 0 135 184 200 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 100.0% 100.0%} 1{40 0.0% 0.0%}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--0.20%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerLock
|
| cb_rw_reader_enter
|
| pthread_rwlock_rdlock
|
|
|
--0.00%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerLock
|
cb_rw_reader_enter
|
pthread_rwlock_rdlock
|
|
4.55% 0.00% 1.24% 25.00% 0x4 0x7f9c23c675b3 591 0 184 59 32 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{32 0.0% 63.6%} 1{32 100.0% 36.4%}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--1.30%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerUnlock
|
| cb_rw_reader_exit
|
| pthread_rwlock_unlock
|
|
|
--0.14%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerUnlock
|
cb_rw_reader_exit
|
pthread_rwlock_unlock
|
|
4.55% 3.03% 0.00% 0.00% 0x14 0x7f9c23c67068 320 128 195 227 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 50.0% n/a} 1{40 50.0% n/a}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--0.06%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerLock
|
| cb_rw_reader_enter
|
| pthread_rwlock_rdlock
|
|
|
--0.00%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerLock
|
cb_rw_reader_enter
|
pthread_rwlock_rdlock
|
|
18.18% 12.12% 0.00% 0.00% 0x18 0x7f9c23c67061 379 108 197 215 40 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{40 75.0% n/a} 1{40 25.0% n/a}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--0.05%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerLock
|
| cb_rw_reader_enter
|
| pthread_rwlock_rdlock
|
|
|
--0.00%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerLock
|
cb_rw_reader_enter
|
pthread_rwlock_rdlock
|
|
4.55% 12.12% 0.00% 0.00% 0x18 0x7f9c23c675ad 521 227 192 57 33 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{33 80.0% n/a} 1{33 20.0% n/a}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--0.01%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerUnlock
|
| cb_rw_reader_exit
|
| pthread_rwlock_unlock
|
|
|
--0.00%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerUnlock
|
cb_rw_reader_exit
|
pthread_rwlock_unlock
|
|
0.00% 0.00% 0.14% 0.00% 0x18 0x7f9c23c675b8 0 0 0 1 1 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 1{ 1 n/a 100.0%}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
StateMachine::conn_execute
|
Cookie::execute
|
execute_client_request_packet
|
set_executor
|
SteppableCommandContext::drive
|
MutationCommandContext::step
|
MutationCommandContext::storeItem
|
bucket_store_if
|
EventuallyPersistentEngine::store_if
|
EventuallyPersistentEngine::storeIfInner
|
KVBucket::set
|
VBucket::set
|
DcpConnMap::notifyVBConnections
|
DcpProducer::notifySeqnoAvailable
|
cb::RWLock::readerUnlock
|
cb_rw_reader_exit
|
pthread_rwlock_unlock
|
|
0.00% 18.18% 0.00% 0.00% 0x1c 0x7f9c23c67608 0 155 174 9 8 [.] pthread_rwlock_unlock libpthread-2.17.so .:0 0{ 8 66.7% n/a} 1{ 8 33.3% n/a}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
|
|
|--0.00%--StateMachine::conn_execute
|
| Cookie::execute
|
| execute_client_request_packet
|
| set_executor
|
| SteppableCommandContext::drive
|
| MutationCommandContext::step
|
| MutationCommandContext::storeItem
|
| bucket_store_if
|
| EventuallyPersistentEngine::store_if
|
| EventuallyPersistentEngine::storeIfInner
|
| KVBucket::set
|
| VBucket::set
|
| DcpConnMap::notifyVBConnections
|
| DcpProducer::notifySeqnoAvailable
|
| cb::RWLock::readerUnlock
|
| cb_rw_reader_exit
|
| pthread_rwlock_unlock
|
|
|
--0.00%--StateMachine::conn_ship_log
|
non-virtual thunk to EventuallyPersistentEngine::step(gsl::not_null<void const*>, gsl::not_null<dcp_message_producers*>)
|
DcpProducer::step
|
DcpProducer::getNextItem
|
cb::RWLock::readerUnlock
|
cb_rw_reader_exit
|
pthread_rwlock_unlock
|
|
0.00% 12.12% 0.00% 0.00% 0x1c 0x7f9c23c670ba 0 249 0 4 4 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 1{ 4 100.0% n/a}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
StateMachine::conn_execute
|
Cookie::execute
|
execute_client_request_packet
|
set_executor
|
SteppableCommandContext::drive
|
MutationCommandContext::step
|
MutationCommandContext::storeItem
|
bucket_store_if
|
EventuallyPersistentEngine::store_if
|
EventuallyPersistentEngine::storeIfInner
|
KVBucket::set
|
VBucket::set
|
DcpConnMap::notifyVBConnections
|
DcpProducer::notifySeqnoAvailable
|
cb::RWLock::readerLock
|
cb_rw_reader_enter
|
pthread_rwlock_rdlock
|
|
9.09% 0.00% 0.00% 0.00% 0x1c 0x7f9c23c670d5 382 0 173 11 12 [.] pthread_rwlock_rdlock libpthread-2.17.so .:0 0{12 50.0% n/a} 1{12 50.0% n/a}
|
|
|
---__clone
|
start_thread
|
platform_thread_wrap
|
worker_libevent
|
event_base_loop
|
event_process_active_single_queue.isra.26
|
event_handler
|
run_event_loop
|
Connection::runEventLoop
|
Connection::runStateMachinery
|
StateMachine::conn_execute
|
Cookie::execute
|
execute_client_request_packet
|
set_executor
|
SteppableCommandContext::drive
|
MutationCommandContext::step
|
MutationCommandContext::storeItem
|
bucket_store_if
|
EventuallyPersistentEngine::store_if
|
EventuallyPersistentEngine::storeIfInner
|
KVBucket::set
|
VBucket::set
|
DcpConnMap::notifyVBConnections
|
DcpProducer::notifySeqnoAvailable
|
cb::RWLock::readerLock
|
cb_rw_reader_enter
|
pthread_rwlock_rdlock
|
Tested using build 2384. This cache line is contentious.
-------------------------------------------------------------
|
1 1209 839 1863 0 0x7cec80
|
-------------------------------------------------------------
|
96.77% 97.74% 0.00% 0.00% 0x30 0x48fcbd 296 145 161 2625 40 [.] update_topkeys memcached unique_ptr.h:147 0{40 56.3% n/a} 1{40 43.7% n/a}
|
3.23% 2.26% 100.00% 0.00% 0x38 0x48e1c0 402 182 225 5269 40 [.] mcbp_add_header memcached atomic_base.h:514 0{40 63.8% 57.8%} 1{40 36.2% 42.2%}
|
The issue is in Bucket.
(gdb) pahole Bucket
|
/* 179352 */ struct Bucket {
|
/* 0 40 */ std::mutex mutex
|
/* 40 48 */ std::condition_variable cond
|
/* 88 4 */ unsigned int clients
|
/* 92 1 */ std::atomic<Bucket::State> state
|
/* 93 1 */ Bucket::Type type
|
/* 94 101 */ char [101] name
|
/* XXX 40 bit hole, try to pack */
|
/* 200 96 */ std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers
|
/* 296 24 */ std::vector<thread_stats, std::allocator<thread_stats> > stats
|
/* 320 176504 */ Timings timings
|
/* 176824 672 */ TimingHistogram subjson_operation_times
|
/* 177496 8 */ std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys
|
/* 177504 1704 */ std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters
|
/* 179208 64 */ ClusterConfiguration clusterConfiguration
|
/* 179272 8 */ unsigned long max_document_size
|
/* 179280 56 */ std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures
|
/* 179336 8 */ EngineIface * engine
|
/* 179344 8 */ DcpIface * bucketDcp
|
}
|
Not printing cache lines because timings is ludicrously big, but the topkeys unique ptr shares a cache line with the responseCounters array. The responseCounters array is full of atomics that are used to track number of responses of a certain type. The first (and almost certainly the hottest) is SUCCESS. This is sharing with topkeys and as topkeys is only read in the sample, this is certainly false sharing. Will put this just after name. This will share a cache line with the end of name and the start of engine_event_handlers.
(gdb) pahole Bucket
|
/* 179352 */ struct Bucket {
|
/* 0 40 */ std::mutex mutex
|
/* 40 48 */ std::condition_variable cond
|
/* 88 4 */ unsigned int clients
|
/* 92 1 */ std::atomic<Bucket::State> state
|
/* 93 1 */ Bucket::Type type
|
/* 94 101 */ char [101] name
|
/* XXX 40 bit hole, try to pack */
|
/* 200 8 */ std::unique_ptr<TopKeys, std::default_delete<TopKeys> > topkeys
|
/* 208 96 */ std::array<std::vector<engine_event_handler, std::allocator<engine_event_handler> >, 4> engine_event_handlers
|
/* 304 24 */ std::vector<thread_stats, std::allocator<thread_stats> > stats
|
/* 328 176504 */ Timings timings
|
/* 176832 672 */ TimingHistogram subjson_operation_times
|
/* 177504 1704 */ std::array<Couchbase::RelaxedAtomic<unsigned long>, 213> responseCounters
|
/* 179208 64 */ ClusterConfiguration clusterConfiguration
|
/* 179272 8 */ unsigned long max_document_size
|
/* 179280 56 */ std::unordered_set<cb::engine::Feature, std::hash<cb::engine::Feature>, std::equal_to<cb::engine::Feature>, std::allocator<cb::engine::Feature> > supportedFeatures
|
/* 179336 8 */ EngineIface * engine
|
/* 179344 8 */ DcpIface * bucketDcp
|
}
|
I think the rwlock cache line is normal contention when trying to acquire the rwlock guarding the streams map in the DcpProducer. The only other option would be for it to be contention on one of the locks in a StreamContainer, however we should see a contentious cache line for each lock (1 per StreamContainer i.e. per active vb) if that were the case. As a possible solution, we could "fix" this by removing the lock guarding the streams map. Each StreamContainer has it's own rwlock that we could rely on instead, splitting the contention over a number of cache lines equal to the number of active vbs. Would require a bit of rework to allow a StreamContainer to exist empty, and we would have to pre-populate the streams map with every vb and an empty StreamContainer.
Ben Huddleston This MB is getting pretty long (and has diverged from the original problem). Suggest you resolve this one (assuming the DcpProducer layout change did fix the initial regression and it's just we have new issues now) - and open new MB(s) for each specific new issue.
Created a few separate issues to track performance improvements I've identified as part of looking into perf c2c data and the DcpProducer in general. (MB-33151, MB-33153, MB-33157).
The fix for this regression did bring performance back to the old value, however by the time it was merged the "long pole" in the DcpProducer has shifted so the fix no longer improves performance. Will fix the other issues mentioned and see if performance improves to previous values. MB-32389 is a bucket ticket for a general perf regression in kv_engine which may be of interest. Will resolve this issue as the regression has been fixed (the contentious cache line which was introduced is no longer present as of build 2202).
Technically this regression is fixed. The symptom found in a perf c2c profile is no longer present. The long pole for perf in the write heavy test has shifted since this regression was fixed however so the perf numbers have not improved.
Changelog between 1545 and 1561 is pretty long - will need binary chopping to isolate: http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-1545&tob=6.5.0-1561