Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49170

Replica item count lagging active in Magma insert test

    XMLWordPrintable

Details

    Description

      Note: Initially opened on performance variation between, that has been addressed, but now tracking the issue where replica item count does not reach active

      In Magma insert only tests, we see high performance variation. 

      http://172.23.123.237/#/timeline/Linux/hidd/S0/all

      In the latest runs with build 7.1.0-1558, the throughput changed from 122K to 226K.

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_710-1558_access_key_prefix_e3f5&&label=high_insert_rate&snapshot=rhea_710-1558_access_key_prefix_3994&label=low_insert_rate

      In the run having higher throughput, replica sync rate can't catch up.

        There are more sync write flushes after a certain point.

        Sarath Lakshman

      Please take a look. Is there a way we can change checkpoint settings? It looks like the runs can go to different modes (or code paths), even with the same build.

      Attachments

        1. 2c9be7495d498bf4ae151733781c8069.png
          34 kB
          Dave Rigby
        2. fc98c640dcf022cd0036bb64ca36e284.png
          36 kB
          Dave Rigby
        3. MB-49170_build-1729.png
          485 kB
          Paolo Cocchi
        4. Screen Shot 2021-10-26 at 11.17.30 AM.png
          126 kB
          Bo-Chun Wang
        5. Screen Shot 2021-10-26 at 11.40.04 AM.png
          194 kB
          Bo-Chun Wang
        6. Screen Shot 2021-10-26 at 11.40.28 AM.png
          192 kB
          Bo-Chun Wang
        7. Screen Shot 2021-10-26 at 11.46.41 AM.png
          190 kB
          Bo-Chun Wang
        8. Screen Shot 2021-10-28 at 6.14.34 PM.png
          125 kB
          Bo-Chun Wang
        9. Screen Shot 2021-11-09 at 1.57.13 PM.png
          63 kB
          Bo-Chun Wang
        10. Screen Shot 2021-11-09 at 2.05.11 PM.png
          118 kB
          Bo-Chun Wang
        11. Screenshot 2021-11-09 at 2.32.41 PM.png
          12 kB
          Sarath Lakshman
        12. Screenshot 2021-11-09 at 2.35.58 PM.png
          131 kB
          Sarath Lakshman
        13. Screenshot 2021-11-10 at 15.30.33.png
          90 kB
          Dave Rigby
        14. Screenshot 2021-11-10 at 15.41.23.png
          57 kB
          Dave Rigby
        15. Screenshot 2021-11-10 at 15.46.40.png
          378 kB
          Dave Rigby

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            bo-chun.wang Bo-Chun Wang added a comment - - edited

            Currently, our throughput tests are measuring how many client requests the cluster can handle, and I think the number is important. The same concept is applied to latency tests. We measure how fast client requests can be completed. The latency is measured when the requests are completed but not when replication is done. I agreed DCP replication performance is important, but we should measure the two things separately. 

            1. client request process performance (the throughput we measure)
            2. dcp replication performance

            If the client request process performance is 10% slower and DCP replication performance is 20% faster, I don't think it means the overall system performance is better.

            Rebalance tests are one method we use to measure DCP replication performance. If there are any ideas about how we should measure DCP replication performance, please let me know. For this particular test (insert only), another thing we can try is to do a run without replication. 

            bo-chun.wang Bo-Chun Wang added a comment - - edited Currently, our throughput tests are measuring how many client requests the cluster can handle, and I think the number is important. The same concept is applied to latency tests. We measure how fast client requests can be completed. The latency is measured when the requests are completed but not when replication is done. I agreed DCP replication performance is important, but we should measure the two things separately.  client request process performance (the throughput we measure) dcp replication performance If the client request process performance is 10% slower and DCP replication performance is 20% faster, I don't think it means the overall system performance is better. Rebalance tests are one method we use to measure DCP replication performance. If there are any ideas about how we should measure DCP replication performance, please let me know. For this particular test (insert only), another thing we can try is to do a run without replication. 

            We ideally want to measure the sustainable write throughput (which includes replication not lagging). This will require the test framework to slow down and adjust the requests rate sufficient enough to balance replication. Maybe reducing the number of clients may slightly help here. If we wait for replication to complete for the measurement, the throughput number itself will not be relevant, but it can be used as a number for catching regressions.
            Since this is a sanity test for the purpose of specially measuring storage engine insert rate, changing the test to no replica config also should be okay.

            sarath Sarath Lakshman added a comment - We ideally want to measure the sustainable write throughput (which includes replication not lagging). This will require the test framework to slow down and adjust the requests rate sufficient enough to balance replication. Maybe reducing the number of clients may slightly help here. If we wait for replication to complete for the measurement, the throughput number itself will not be relevant, but it can be used as a number for catching regressions. Since this is a sanity test for the purpose of specially measuring storage engine insert rate, changing the test to no replica config also should be okay.
            drigby Dave Rigby added a comment -

            I agree with both of you - we want to see how fast KV-Engine can process client requests (and under what latency), but we also want to measure how fast KV-Engine can handle the full lifetime of a request (processing off network, writing to local disk; replicating to remote cluster and writing to disk there).

            As Bo-Chun suggested, the first type of test doesn't really care about replication; so we could simply run without replicas; measuring the throughput where we just process the request on the active node and write it to disk.

            The second type of test is pretty much the test we have here (I believe); however we do want to make sure the performance of the system can be sustained; and we are not building up backlogs which only recover when the workload stops running - i.e. if the test ran for a longer period we would see a drop in throughput long-term.

            Note that KV-Engine will already apply backpressure to the client(s) if persistence / replication cannot keep up (and Bucket memory has been saturated). This back pressure should be better behaved in Neo due to the changes done by Paolo Cocchi and James Harrison recently - namely giving outstanding mutations to be written to disk / replicated (CheckpointManager) their own sub-quota of the overall Bucket quota. We are still tuning this, but currently the CheckpointManager gets a maximum of 30% of the overall Bucket quota in Neo; whereas previously it could in theory consume the entire Bucket quota (leaving essentially zero memory for caching recently accessed items / disk backfills for other DCP clients etc).
            I mention this because if we have perf tests which KV-Engine cannot sustain the operation rate, but for the (short) duration they run there was sufficient Bucket Quota in 7.0.0 to cache dirty / queued for replication items, we could observe lower throughput under Neo as we would start applying back-pressure to clients "sooner". I would argue that this isn't really a regression, the test simply didn't run for long enough in the first place as a similar drop in throughput would be seen in 7.0.0, if the test duration was extended.

            drigby Dave Rigby added a comment - I agree with both of you - we want to see how fast KV-Engine can process client requests (and under what latency), but we also want to measure how fast KV-Engine can handle the full lifetime of a request (processing off network, writing to local disk; replicating to remote cluster and writing to disk there). As Bo-Chun suggested, the first type of test doesn't really care about replication; so we could simply run without replicas; measuring the throughput where we just process the request on the active node and write it to disk. The second type of test is pretty much the test we have here (I believe); however we do want to make sure the performance of the system can be sustained; and we are not building up backlogs which only recover when the workload stops running - i.e. if the test ran for a longer period we would see a drop in throughput long-term. Note that KV-Engine will already apply backpressure to the client(s) if persistence / replication cannot keep up (and Bucket memory has been saturated). This back pressure should be better behaved in Neo due to the changes done by Paolo Cocchi and James Harrison recently - namely giving outstanding mutations to be written to disk / replicated (CheckpointManager) their own sub-quota of the overall Bucket quota. We are still tuning this, but currently the CheckpointManager gets a maximum of 30% of the overall Bucket quota in Neo; whereas previously it could in theory consume the entire Bucket quota (leaving essentially zero memory for caching recently accessed items / disk backfills for other DCP clients etc). I mention this because if we have perf tests which KV-Engine cannot sustain the operation rate, but for the (short) duration they run there was sufficient Bucket Quota in 7.0.0 to cache dirty / queued for replication items, we could observe lower throughput under Neo as we would start applying back-pressure to clients "sooner". I would argue that this isn't really a regression, the test simply didn't run for long enough in the first place as a similar drop in throughput would be seen in 7.0.0, if the test duration was extended.
            paolo.cocchi Paolo Cocchi added a comment -

            Just a small addition. I do get Bo's PoV on isolating "client request process performance" tests from "dcp replication performance". But the problem is that the frontend throughput is directly affected by replication being in place (for a number of reasons, eg frontend and replication race on Checkpoints). That is easily observable even on a very simple cluster_run. So I'm not sure how representative a 0-replica ingestion test would be.

            paolo.cocchi Paolo Cocchi added a comment - Just a small addition. I do get Bo's PoV on isolating "client request process performance" tests from "dcp replication performance". But the problem is that the frontend throughput is directly affected by replication being in place (for a number of reasons, eg frontend and replication race on Checkpoints). That is easily observable even on a very simple cluster_run. So I'm not sure how representative a 0-replica ingestion test would be.

            Closing all Duplicates, Not a Bug, Incomplete, Duplicate

            ritam.sharma Ritam Sharma added a comment - Closing all Duplicates, Not a Bug, Incomplete, Duplicate

            People

              bo-chun.wang Bo-Chun Wang
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty