In my latest tests have been repeating the Rebalance and the DataCopy runs (mentioned at points (1) and (2) in my previous comment) against a "fast" DCP Consumer.
Patch http://review.couchbase.org/c/kv_engine/+/134989 implements the fast Consumer. That does 2 things:
- An incoming DCP Mutation is not actually processed. memcached just increments the Item Count and the High Persisted Seqno for the owning VBucket.
- memcached handles Seqno Persistence requests by sending back a response based on the "fake" High Persisted Seqno.
Essentially (1) "implements" a fast DCP Consumer by removing most of the code that is usually executed at Consumer for a DCP Mutation.
(2) is necessary for making ns_server happy at Rebalance.
NOTE: What just described clearly disables persistence at destination, which may be thought as altering the test result. But actually it does not: as detailed in the "DCP - MEM" doc, tests have been run with higher item sizes that push the throughput (MB/s) to much higher values, and persistence perfectly keeps up on those tests too. Also, persistence at destination is disabled for both Rebalance and DataCopy, so we have a fair comparison here.
---------------------------
UPDATE 03/09/2020
I have been investigating some unexpected high data streamed during the Rebalance test against our modified "fast consumer". I found that disabling persistence at destination causes rollback/re-stream of 2 vbuckets out of 4, which pushes the total data streamed to ~3GB (rather than 2GB as in the mainstream Rebalance). As the actual profiling shows, the real throughput at Rebalance is ~80 MB/s (rather than 55 MB/s). I update the value in the table below.
Note that the general outcome doesn't change. The DCP Consumer in memcached appears to be the first bottleneck that we hit. As soon as we improve that we see a speedup at both Rebalance and DataCopy. DataCopy scales much better (3x) then Rebalance (<2x), which is an indication that at some point we would hit the ns_server proxy bottleneck. That point is now shifted from 55 MB/s to 80 MB/s.
I add this update and I just make minor changes to the original message below for emphasizing that the general outcome is still valid.
---------------------------
Results - Comparion between vbucket-copy via ns_server (Rebalance) vs cluster_test (DataCopy)
Baseline - Cheshire Cat
Test |
Throughput |
DCP Consumer thread CPU Util |
Rebalance |
45 MB/s |
85% |
DataCopy |
60 MB/s |
100% |
Fast DCP Consumer (Cheshire Cat + http://review.couchbase.org/c/kv_engine/+/134989)
Test |
Throughput |
DCP Consumer thread CPU Util |
Rebalance |
55 MB/s 80 MB/s |
55% |
DataCopy |
150 MB/s |
90% |
Comments:
- The performance at Rebalance improves
marginally. I see CPU underutilization in memcached at destination, which suggests that some component back in the stack is slowing us down. Memcached at source has been tested being capable of backfilling/sending at ~175 MB/s on the same test/env, so finger pointed to the ns_server proxy.
- The performance at DataCopy improves considerably. As already mentioned, here the only difference with Rebalance is that replication goes over our ClusterTest proxy rather than the ns_server proxy. We don't saturate CPU utilization at destination, but we achieve a much higher value than what seen at Rebalance.
For what we see here, while the DCP Consumer is the first bottleneck that we hit, even with small improvements to it we would quickly hit the ns_server proxy limit. As such we would also need to address the ns_server bottleneck to see any significant improvement to the end-user.
Note that the linux-perf profiling of the DCP Consumer doesn't spot any evident suboptimal code-path, so for now only minor improvements seem possible in memcached. Linux perf data attached (fill.perf.script, ready for visualization on Speedscope).
Dave Finlay It would be interesting to hear ns_server's opinion/validation on the results described here?
Steps for reproducing the Rebalance test:
- checkout couchbase/master + cherry-pick http://review.couchbase.org/c/kv_engine/+/134989
- export COUCHBASE_NUM_VBUCKETS=4 && ./cluster_run -n 2 --start-index=10
- ./couchbase-cli cluster-init --cluster=localhost:9010 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=20480
- ./couchbase-cli bucket-create -c localhost:9010 -u admin -p admin1 --bucket=example --bucket-type=couchbase --bucket-ramsize=20480 --bucket-replica=1 --bucket-eviction-policy=fullEviction --enable-flush=1 --wait
- cbc-pillowfight --spec="couchbase://127.0.0.1:12020/example" --username=admin --password=admin1 --batch-size=1000 --num-threads=4 --set-pct=100 --min-size=1024 --max-size=1024 --random-body --populate-only --num-items=2000000
- ./couchbase-cli server-add --cluster=http://127.0.0.1:9010 --username=admin --password=admin1 --server-add=127.0.0.1:9011 --server-add-username=admin --server-add-password=admin1
- time ./couchbase-cli rebalance -c localhost:9010 -u admin -p admin1
The Rebalance above is just functional to executing 4 vbucket-copies from n_0 to n_1, which is exactly what we reproduce in our DataCopy test with the same default cluster/bucket configuration.
Thank you,
Paolo
Hi Shivani Gupta,
The Consumer stop processing messages because the mem-usage reaches the Replication Threshold (99% of the bucket quota by default). That is part of the memcached resource-utilization control.
In a scenario like the one described above (where we have already ejected everything from the HashTable) one possibility is to try to release the Checkpoint memory more aggressively, in particular for Replica vbuckets. That is why I refer to Item Expel (from Checkpoints) in my previous message.
The idea is that it may help with recovering from the high mem-usage quicker, so that the Consumer drops below the Replication Threshold and resumes ingesting messages more promptly. I've experimented a similar approach recently at
MB-38981and that gave interesting results, so I'm surely experimenting that here too.As you know DCP may have multiple bottlenecks. In scenarios where you backfill massively usually the bottleneck is the backfill throughput (ie, disk read) at Producer.
While here the Producer streams fast and the bottleneck is the high mem-usage at Consumer.
I've started back from easier tests to check out how DCP performs when there is not too much memory pressure. That relates to
MB-29325too.