I re-ran the two tests 8 times and the supposed regression would be 3k ops/s, possibly just due to variation between any given run.
1690 ops/s |
1691 ops/s |
2558629 |
2504171 |
2555644 |
2543146 |
2582902 |
2538969 |
2475998 |
2548453 |
2513313 |
2517831 |
2555190 |
2556105 |
2544343 |
2550765 |
2537839 |
2541184 |
Averages |
======= |
2,540,482.25 |
2,537,578 |
Investigation
I pulled the log files for these test runs given in the description to do some investigating, wanted to know how much we were logging.
From build 1690.
EMEAUKMAC01:1690 benhuddleston$ find . -name 'memcached*' | xargs wc -l
|
193 ./cbcollect_info_ns_1@172.23.133.11_20181217-195405/memcached.log
|
197 ./cbcollect_info_ns_1@172.23.133.12_20181217-195405/memcached.log
|
39836 ./cbcollect_info_ns_1@172.23.133.14_20181217-195404/memcached.log
|
39516 ./cbcollect_info_ns_1@172.23.133.13_20181217-195404/memcached.log
|
79742 total
|
EMEAUKMAC01:1690 benhuddleston$
|
From build 1691.
EMEAUKMAC01:1691 benhuddleston$ find . -name 'memcached*' | xargs wc -l
|
37881 ./cbcollect_info_ns_1@172.23.133.14_20181217-192549/memcached.log
|
37830 ./cbcollect_info_ns_1@172.23.133.13_20181217-192549/memcached.log
|
209 ./cbcollect_info_ns_1@172.23.133.12_20181217-192549/memcached.log
|
203 ./cbcollect_info_ns_1@172.23.133.11_20181217-192549/memcached.log
|
76123 total
|
So we actually logged less in build 1691. Seems a bit odd given the test took longer but we log a lot during compaction and from the report graphs it looks like we have started an extra compaction on run 1690. Given the changes made between builds have nothing to do with compaction I won't worry about that.

From looking at the changes above, the only way I can see performance being slower is if the buffer fills up at some point and we block. Given the amount of messages in nodes 3 and 4, I took a look at the distribution of log messages over the course of the test. Using the command
awk '{print substr($0,0,19)}' memcached.log | uniq -c | sort -k 2
|
I can see how many messages we try to log every second. This timestamp is written into the log message when we sink a message, so when the async logging thread is writing it to one of it's outputs. Numbers of log messages sunk every second spiked early when we set up DCP connections (~4000 in 1 second) before starting the test, then dropped to an average of (~20-30 per second) on both nodes.
The regression still didn't make much sense to me so I thought about why the buffer may be full now when it wasn't previously (prior to the upgrade from version 0.14.0 of spdlog to version 1.1.0). The async logger in spdlog 0.14.0 uses a fixed length buffer of 2MB , and async log messages containing a std::string end up being written into a queue that uses this 2MB buffer. The async logger in spdlog 1.1.0 creates a buffer for every message of 176 chars (176 bytes + a little extra for management). Anything longer gets allocated somewhere else by fmtlib but that's not a concern here. These messages are then written into a queue that is basically a vector of (as of build 1691) 8192 async log messages. Okay, so if our messages are smaller than 176 bytes then it looks like we'll be wasting a bit of space. But how long are our messages? We can print a rolling average of the length of each line in the log using this command
awk '{total += length; num++; print total/num}' memcached.log
|
The average for nodes 3 and 4 in both builds was 161-162, whilst the average for nodes 1 and 2 was 91-93. When I first checked these averages, I only checked for nodes 1 and 2, so with an average of about 90 characters per log message, we'd be effectively halving the size of our buffer. I created a build that doubles the size of the buffer and tested this.
1690 ops/s |
1691 ops/s |
Doubled Buffer ops/s |
2558629 |
2504171 |
2522314 |
2555644 |
2543146 |
2553333 |
2582902 |
2538969 |
2554278 |
2475998 |
2548453 |
2524549 |
2513313 |
2517831 |
2658118 |
2555190 |
2556105 |
2542584 |
2544343 |
2550765 |
2561333 |
2537839 |
2541184 |
2520652 |
Averages |
======= |
======= |
2,540,482.25 |
2,537,578 |
2,554,645.13 |
One of the results of 2,658,118 looks like a potential outlier and when I remove it I get an average of 2,539,863.29. So, probably not worth the change although I could do some more runs to verify.
One thing to note is that these builds include the "CouchKVStore::readCollectionsManifest: doc not found" message that makes up roughly ~20-25% of the logs on nodes 3 and 4. If we remove it, the average length of our log messages from these nodes is 177-179 characters. This may be a concern as we do extra allocations for log messages over 176 characters in length. Out of interest, I grabbed some details about the lengths of log messages using the command
awk '{print length}' memcached.log | sort | uniq -c | sort -k 2 -n >> logLengthDistribution.txt
|
Column 1 is number of occurences, column 2 is length of line. logLengthDistribution.txt
Assigning to Ben as he made that change. Can you take an initial look this morning, then we can discuss it this afternoon?