Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36370

HiDD: Scalable DCP and disk backfill performance

    XMLWordPrintable

Details

    • KV: DCP Scalability
    • To Do

    Description

      We have observed that DCP throughput per server in a two-node cluster max out at 70K/s and 10K/s on disk backfill. This primarily becomes a bottleneck when the storage engine is faster than DCP. Since replica depends on DCP for replication, overall write throughput of the system is capped by DCP replication throughput.

      Attachments

        1. 25 - dcp_replica.png
          25 - dcp_replica.png
          246 kB
        2. 26 - disk-queue fill-drain.png
          26 - disk-queue fill-drain.png
          334 kB
        3. 26 - mem_usage.png
          26 - mem_usage.png
          401 kB
        4. fill.perf.script
          2.31 MB

        Issue Links

          For Gerrit Dashboard: MB-36370
          # Subject Branch Project Status CR V

          Activity

            paolo.cocchi Paolo Cocchi added a comment - - edited

            Hi Shivani Gupta,

            What is the fix for the above? .. Why does the DCP consumer stop processing messages?

            The Consumer stop processing messages because the mem-usage reaches the Replication Threshold (99% of the bucket quota by default). That is part of the memcached resource-utilization control.
            In a scenario like the one described above (where we have already ejected everything from the HashTable) one possibility is to try to release the Checkpoint memory more aggressively, in particular for Replica vbuckets. That is why I refer to Item Expel (from Checkpoints) in my previous message.
            The idea is that it may help with recovering from the high mem-usage quicker, so that the Consumer drops below the Replication Threshold and resumes ingesting messages more promptly. I've experimented a similar approach recently at MB-38981 and that gave interesting results, so I'm surely experimenting that here too.

            As you know DCP may have multiple bottlenecks. In scenarios where you backfill massively usually the bottleneck is the backfill throughput (ie, disk read) at Producer.
            While here the Producer streams fast and the bottleneck is the high mem-usage at Consumer.

            I've started back from easier tests to check out how DCP performs when there is not too much memory pressure. That relates to MB-29325 too.

            paolo.cocchi Paolo Cocchi added a comment - - edited Hi Shivani Gupta , What is the fix for the above? .. Why does the DCP consumer stop processing messages? The Consumer stop processing messages because the mem-usage reaches the Replication Threshold (99% of the bucket quota by default). That is part of the memcached resource-utilization control. In a scenario like the one described above (where we have already ejected everything from the HashTable) one possibility is to try to release the Checkpoint memory more aggressively, in particular for Replica vbuckets. That is why I refer to Item Expel (from Checkpoints) in my previous message. The idea is that it may help with recovering from the high mem-usage quicker, so that the Consumer drops below the Replication Threshold and resumes ingesting messages more promptly. I've experimented a similar approach recently at MB-38981 and that gave interesting results, so I'm surely experimenting that here too. As you know DCP may have multiple bottlenecks. In scenarios where you backfill massively usually the bottleneck is the backfill throughput (ie, disk read) at Producer. While here the Producer streams fast and the bottleneck is the high mem-usage at Consumer. I've started back from easier tests to check out how DCP performs when there is not too much memory pressure. That relates to MB-29325 too.

            Build couchbase-server-7.0.0-2570 contains kv_engine commit 58e323a with commit message:
            MB-36370: Remove BackfillManager::bytesForceRead

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2570 contains kv_engine commit 58e323a with commit message: MB-36370 : Remove BackfillManager::bytesForceRead
            paolo.cocchi Paolo Cocchi added a comment - - edited

            In my latest tests have been repeating the Rebalance and the DataCopy runs (mentioned at points (1) and (2) in my previous comment) against a "fast" DCP Consumer.

            Patch http://review.couchbase.org/c/kv_engine/+/134989 implements the fast Consumer. That does 2 things:

            1. An incoming DCP Mutation is not actually processed. memcached just increments the Item Count and the High Persisted Seqno for the owning VBucket.
            2. memcached handles Seqno Persistence requests by sending back a response based on the "fake" High Persisted Seqno.

            Essentially (1) "implements" a fast DCP Consumer by removing most of the code that is usually executed at Consumer for a DCP Mutation.
            (2) is necessary for making ns_server happy at Rebalance.

            NOTE: What just described clearly disables persistence at destination, which may be thought as altering the test result. But actually it does not: as detailed in the "DCP - MEM" doc, tests have been run with higher item sizes that push the throughput (MB/s) to much higher values, and persistence perfectly keeps up on those tests too. Also, persistence at destination is disabled for both Rebalance and DataCopy, so we have a fair comparison here.

             

            ---------------------------
            UPDATE 03/09/2020
            I have been investigating some unexpected high data streamed during the Rebalance test against our modified "fast consumer". I found that disabling persistence at destination causes rollback/re-stream of 2 vbuckets out of 4, which pushes the total data streamed to ~3GB (rather than 2GB as in the mainstream Rebalance). As the actual profiling shows, the real throughput at Rebalance is ~80 MB/s (rather than 55 MB/s). I update the value in the table below.
            Note that the general outcome doesn't change. The DCP Consumer in memcached appears to be the first bottleneck that we hit. As soon as we improve that we see a speedup at both Rebalance and DataCopy. DataCopy scales much better (3x) then Rebalance (<2x), which is an indication that at some point we would hit the ns_server proxy bottleneck. That point is now shifted from 55 MB/s to 80 MB/s.
            I add this update and I just make minor changes to the original message below for emphasizing that the general outcome is still valid.

            ---------------------------

             

            Results - Comparion between vbucket-copy via ns_server (Rebalance) vs cluster_test (DataCopy)

            Baseline - Cheshire Cat

            Test Throughput DCP Consumer thread CPU Util
            Rebalance 45 MB/s 85%
            DataCopy 60 MB/s 100%

            Fast DCP Consumer (Cheshire Cat + http://review.couchbase.org/c/kv_engine/+/134989)

            Test Throughput DCP Consumer thread CPU Util
            Rebalance 55 MB/s 80 MB/s 55%
            DataCopy 150 MB/s 90%

             

            Comments:

            • The performance at Rebalance improves marginally. I see CPU underutilization in memcached at destination, which suggests that some component back in the stack is slowing us down. Memcached at source has been tested being capable of backfilling/sending at ~175 MB/s on the same test/env, so finger pointed to the ns_server proxy.
            • The performance at DataCopy improves considerably. As already mentioned, here the only difference with Rebalance is that replication goes over our ClusterTest proxy rather than the ns_server proxy. We don't saturate CPU utilization at destination, but we achieve a much higher value than what seen at Rebalance.

            For what we see here, while the DCP Consumer is the first bottleneck that we hit, even with small improvements to it we would quickly hit the ns_server proxy limit. As such we would also need to address the ns_server bottleneck to see any significant improvement to the end-user.
            Note that the linux-perf profiling of the DCP Consumer doesn't spot any evident suboptimal code-path, so for now only minor improvements seem possible in memcached. Linux perf data attached (fill.perf.script, ready for visualization on Speedscope).

             

            Dave Finlay It would be interesting to hear ns_server's opinion/validation on the results described here?

            Steps for reproducing the Rebalance test:

            • checkout couchbase/master + cherry-pick http://review.couchbase.org/c/kv_engine/+/134989
            • export COUCHBASE_NUM_VBUCKETS=4 && ./cluster_run -n 2 --start-index=10
            • ./couchbase-cli cluster-init --cluster=localhost:9010 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=20480
            • ./couchbase-cli bucket-create -c localhost:9010 -u admin -p admin1 --bucket=example --bucket-type=couchbase --bucket-ramsize=20480 --bucket-replica=1 --bucket-eviction-policy=fullEviction --enable-flush=1 --wait
            • cbc-pillowfight --spec="couchbase://127.0.0.1:12020/example" --username=admin --password=admin1 --batch-size=1000 --num-threads=4 --set-pct=100 --min-size=1024 --max-size=1024 --random-body --populate-only --num-items=2000000
            • ./couchbase-cli server-add --cluster=http://127.0.0.1:9010 --username=admin --password=admin1 --server-add=127.0.0.1:9011 --server-add-username=admin --server-add-password=admin1
            • time ./couchbase-cli rebalance -c localhost:9010 -u admin -p admin1

            The Rebalance above is just functional to executing 4 vbucket-copies from n_0 to n_1, which is exactly what we reproduce in our DataCopy test with the same default cluster/bucket configuration.

            Thank you,
            Paolo

            paolo.cocchi Paolo Cocchi added a comment - - edited In my latest tests have been repeating the Rebalance and the DataCopy runs (mentioned at points (1) and (2) in my previous comment) against a "fast" DCP Consumer. Patch http://review.couchbase.org/c/kv_engine/+/134989 implements the fast Consumer. That does 2 things: An incoming DCP Mutation is not actually processed. memcached just increments the Item Count and the High Persisted Seqno for the owning VBucket. memcached handles Seqno Persistence requests by sending back a response based on the "fake" High Persisted Seqno. Essentially (1) "implements" a fast DCP Consumer by removing most of the code that is usually executed at Consumer for a DCP Mutation. (2) is necessary for making ns_server happy at Rebalance. NOTE: What just described clearly disables persistence at destination, which may be thought as altering the test result. But actually it does not: as detailed in the "DCP - MEM" doc, tests have been run with higher item sizes that push the throughput (MB/s) to much higher values, and persistence perfectly keeps up on those tests too. Also, persistence at destination is disabled for both Rebalance and DataCopy, so we have a fair comparison here.   --------------------------- UPDATE 03/09/2020 I have been investigating some unexpected high data streamed during the Rebalance test against our modified "fast consumer". I found that disabling persistence at destination causes rollback/re-stream of 2 vbuckets out of 4, which pushes the total data streamed to ~3GB (rather than 2GB as in the mainstream Rebalance). As the actual profiling shows, the real throughput at Rebalance is ~80 MB/s (rather than 55 MB/s). I update the value in the table below. Note that the general outcome doesn't change. The DCP Consumer in memcached appears to be the first bottleneck that we hit. As soon as we improve that we see a speedup at both Rebalance and DataCopy. DataCopy scales much better (3x) then Rebalance (<2x), which is an indication that at some point we would hit the ns_server proxy bottleneck. That point is now shifted from 55 MB/s to 80 MB/s. I add this update and I just make minor changes to the original message below for emphasizing that the general outcome is still valid. ---------------------------   Results - Comparion between vbucket-copy via ns_server (Rebalance) vs cluster_test (DataCopy) Baseline - Cheshire Cat Test Throughput DCP Consumer thread CPU Util Rebalance 45 MB/s 85% DataCopy 60 MB/s 100% Fast DCP Consumer (Cheshire Cat + http://review.couchbase.org/c/kv_engine/+/134989 ) Test Throughput DCP Consumer thread CPU Util Rebalance 55 MB/s 80 MB/s 55% DataCopy 150 MB/s 90%   Comments: The performance at Rebalance improves marginally . I see CPU underutilization in memcached at destination, which suggests that some component back in the stack is slowing us down. Memcached at source has been tested being capable of backfilling/sending at ~175 MB/s on the same test/env, so finger pointed to the ns_server proxy. The performance at DataCopy improves considerably. As already mentioned, here the only difference with Rebalance is that replication goes over our ClusterTest proxy rather than the ns_server proxy. We don't saturate CPU utilization at destination, but we achieve a much higher value than what seen at Rebalance. For what we see here, while the DCP Consumer is the first bottleneck that we hit, even with small improvements to it we would quickly hit the ns_server proxy limit. As such we would also need to address the ns_server bottleneck to see any significant improvement to the end-user. Note that the linux-perf profiling of the DCP Consumer doesn't spot any evident suboptimal code-path, so for now only minor improvements seem possible in memcached. Linux perf data attached (fill.perf.script, ready for visualization on Speedscope).   Dave Finlay It would be interesting to hear ns_server's opinion/validation on the results described here? Steps for reproducing the Rebalance test: checkout couchbase/master + cherry-pick http://review.couchbase.org/c/kv_engine/+/134989 export COUCHBASE_NUM_VBUCKETS=4 && ./cluster_run -n 2 --start-index=10 ./couchbase-cli cluster-init --cluster=localhost:9010 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=20480 ./couchbase-cli bucket-create -c localhost:9010 -u admin -p admin1 --bucket=example --bucket-type=couchbase --bucket-ramsize=20480 --bucket-replica=1 --bucket-eviction-policy=fullEviction --enable-flush=1 --wait cbc-pillowfight --spec="couchbase://127.0.0.1:12020/example" --username=admin --password=admin1 --batch-size=1000 --num-threads=4 --set-pct=100 --min-size=1024 --max-size=1024 --random-body --populate-only --num-items=2000000 ./couchbase-cli server-add --cluster= http://127.0.0.1:9010 --username=admin --password=admin1 --server-add=127.0.0.1:9011 --server-add-username=admin --server-add-password=admin1 time ./couchbase-cli rebalance -c localhost:9010 -u admin -p admin1 The Rebalance above is just functional to executing 4 vbucket-copies from n_0 to n_1, which is exactly what we reproduce in our DataCopy test with the same default cluster/bucket configuration. Thank you, Paolo
            paolo.cocchi Paolo Cocchi added a comment -

            The latest patchset at http://review.couchbase.org/c/kv_engine/+/134989 fixes the re-stream/rollback issues that I've mentioned in my previous comments.
            That ensures that the "fast consumer" Rabalance streams exactly the same amount of data as the mainstream Rebalance.
            As expected the test result (ie, throughput in MB/s) doesn't change, as the buggy test was just streaming more data for a longer runtime.
            I have updated charts at https://docs.google.com/spreadsheets/d/1i0sbcQIKXhveiW2qRrOrcZvv0nSex4nXReYD9KG-zT4.

            paolo.cocchi Paolo Cocchi added a comment - The latest patchset at http://review.couchbase.org/c/kv_engine/+/134989 fixes the re-stream/rollback issues that I've mentioned in my previous comments. That ensures that the "fast consumer" Rabalance streams exactly the same amount of data as the mainstream Rebalance. As expected the test result (ie, throughput in MB/s) doesn't change, as the buggy test was just streaming more data for a longer runtime. I have updated charts at https://docs.google.com/spreadsheets/d/1i0sbcQIKXhveiW2qRrOrcZvv0nSex4nXReYD9KG-zT4 .

            Build couchbase-server-7.0.0-3035 contains kv_engine commit 227d541 with commit message:
            MB-36370: Optimize the cluster_testapp replication proxy

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-3035 contains kv_engine commit 227d541 with commit message: MB-36370 : Optimize the cluster_testapp replication proxy

            People

              paolo.cocchi Paolo Cocchi
              sarath Sarath Lakshman
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are 4 open Gerrit changes

                  PagerDuty