Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49170

Replica item count lagging active in Magma insert test

    XMLWordPrintable

Details

    Description

      Note: Initially opened on performance variation between, that has been addressed, but now tracking the issue where replica item count does not reach active

      In Magma insert only tests, we see high performance variation. 

      http://172.23.123.237/#/timeline/Linux/hidd/S0/all

      In the latest runs with build 7.1.0-1558, the throughput changed from 122K to 226K.

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_710-1558_access_key_prefix_e3f5&&label=high_insert_rate&snapshot=rhea_710-1558_access_key_prefix_3994&label=low_insert_rate

      In the run having higher throughput, replica sync rate can't catch up.

        There are more sync write flushes after a certain point.

        Sarath Lakshman

      Please take a look. Is there a way we can change checkpoint settings? It looks like the runs can go to different modes (or code paths), even with the same build.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            sarath Sarath Lakshman added a comment - - edited

            Could you bisect this to the last good build where the numbers are stable?

            It is not very obvious is any magma or kv change lead to inconsistent throughput.

            The NumSyncFlushes is a side effect of large writes in active buckets and larger write queue when replica didn't catch up.

            sarath Sarath Lakshman added a comment - - edited Could you bisect this to the last good build where the numbers are stable? It is not very obvious is any magma or kv change lead to inconsistent throughput. The NumSyncFlushes is a side effect of large writes in active buckets and larger write queue when replica didn't catch up.
            bo-chun.wang Bo-Chun Wang added a comment -

            I did the bisection. It looks like the issue started in build 1529.

            Build 1528: http://perf.jenkins.couchbase.com/job/rhea-5node2/1563/ 

            Build 1529: http://perf.jenkins.couchbase.com/job/rhea-5node2/1568/ 

             

            bo-chun.wang Bo-Chun Wang added a comment - I did the bisection. It looks like the issue started in build 1529. Build 1528: http://perf.jenkins.couchbase.com/job/rhea-5node2/1563/   Build 1529: http://perf.jenkins.couchbase.com/job/rhea-5node2/1568/    

            There is no magma change between these builds. For kv-engine, there isn't any relevant change. Bo-Chun Wang Could you rerun the builds and disect further?

            sarath Sarath Lakshman added a comment - There is no magma change between these builds. For kv-engine, there isn't any relevant change. Bo-Chun Wang Could you rerun the builds and disect further?
            bo-chun.wang Bo-Chun Wang added a comment -

            I have run both runs two times. I'm not able to bisect further.

             

             

             

            bo-chun.wang Bo-Chun Wang added a comment - I have run both runs two times. I'm not able to bisect further.     Build Throughput Job 7.1.0-1528 288,031 http://perf.jenkins.couchbase.com/job/rhea-5node2/1526/ 7.1.0-1528 307,404 http://perf.jenkins.couchbase.com/job/rhea-5node2/1563/ 7.1.0-1529 120,653 http://perf.jenkins.couchbase.com/job/rhea-5node2/1566/ 7.1.0-1529 120,633 http://perf.jenkins.couchbase.com/job/rhea-5node2/1568/  
            sarath Sarath Lakshman added a comment - - edited

            The write/read write throughput is improved due to block cache-related changes in the recent builds

            We observe that replication is lagging behind significantly now.

            http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_710-1650_access_key_prefix_bd43

            Daniel Owen It would be good if somebody from KV team can take a look at the reason for replication lagging behind.

            sarath Sarath Lakshman added a comment - - edited The write/read write throughput is improved due to block cache-related changes in the recent builds We observe that replication is lagging behind significantly now. http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_710-1650_access_key_prefix_bd43 Daniel Owen It would be good if somebody from KV team can take a look at the reason for replication lagging behind.
            drigby Dave Rigby added a comment -

            Sarath Lakshman Not sure I follow the statement that replication is lagging behind "now" - from the above snapshot (http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_710-1650_access_key_prefix_bd43) I see the following DCP replication queue graph:

            i.e. there's a reasonable steady number of items waiting to be replicated from 600s onwards - between 0 and 600,000 at any one time.

            Looking at the disk queue (i.e. how many items are sitting in CheckpointManager waiting to be persisted to disk, we see the following:

            i.e. the number of outstanding items is hovering around 3-4 million; items just haven't been persisted yet.

            Note also that earlier runs of this test are probably suffer from MB-47318 which results in problems with DCP replication - this affects builds 7.1.0-1571 to 7.1.0-1650 (thanks Dan).

            We probably want a linux perf profile to see what the various threads are doing if we want to further see why disk queue is reasonably large.

            drigby Dave Rigby added a comment - Sarath Lakshman Not sure I follow the statement that replication is lagging behind "now" - from the above snapshot ( http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=rhea_710-1650_access_key_prefix_bd43 ) I see the following DCP replication queue graph: i.e. there's a reasonable steady number of items waiting to be replicated from 600s onwards - between 0 and 600,000 at any one time. Looking at the disk queue (i.e. how many items are sitting in CheckpointManager waiting to be persisted to disk, we see the following: i.e. the number of outstanding items is hovering around 3-4 million; items just haven't been persisted yet. Note also that earlier runs of this test are probably suffer from MB-47318 which results in problems with DCP replication - this affects builds 7.1.0-1571 to 7.1.0-1650 (thanks Dan). We probably want a linux perf profile to see what the various threads are doing if we want to further see why disk queue is reasonably large.
            sarath Sarath Lakshman added a comment - - edited

            We started observing replica_curr_items (410M) much lower than cur_items (800M) for the last couple of months since we removed the cursor_dropping_upper_mark=100 workaround (For last couple of builds pre 7.1.0-1620, we observed the replica and active curr items mostly matching. It could be due to slow writes in magma resulting in overall writes running at slower rates).

            But in general, the test is close to saturating the SSD I/O bandwidth. The write queue growth is due to the I/O saturation. The active current items count is quickly increasing while the replica is catching up slowly. Maybe this is expected behavior for unthrottled write workload since we do not have an incoming mutation throttler for active vbucket while replica vbucket is lagging behind the active vbucket curr items.

            sarath Sarath Lakshman added a comment - - edited We started observing replica_curr_items (410M) much lower than cur_items (800M) for the last couple of months since we removed the cursor_dropping_upper_mark=100 workaround (For last couple of builds pre 7.1.0-1620, we observed the replica and active curr items mostly matching. It could be due to slow writes in magma resulting in overall writes running at slower rates). But in general, the test is close to saturating the SSD I/O bandwidth. The write queue growth is due to the I/O saturation. The active current items count is quickly increasing while the replica is catching up slowly. Maybe this is expected behavior for unthrottled write workload since we do not have an incoming mutation throttler for active vbucket while replica vbucket is lagging behind the active vbucket curr items.
            owend Daniel Owen added a comment -

            Hi Bo-Chun Wang,

            Discussed with Dave Rigby and agreed that would be good to get a rerun on a newer build that does not suffer from the DCP replication regression. (i.e. 7.1.0-1650+)
            Be good to get 3 runs on the same build to gain confidence in the result.

            thanks

            owend Daniel Owen added a comment - Hi Bo-Chun Wang , Discussed with Dave Rigby and agreed that would be good to get a rerun on a newer build that does not suffer from the DCP replication regression. (i.e. 7.1.0-1650+) Be good to get 3 runs on the same build to gain confidence in the result. thanks
            bo-chun.wang Bo-Chun Wang added a comment -

            Daniel Owen

            I have re-run the test with build 7.1.0-1650, and the results are consistent. 

            http://showfast.sc.couchbase.com/#/runs/kv_throughput_250M_insert_only_uniform_rhea_5node2/7.1.0-1650

            The latest result is better, too.

            bo-chun.wang Bo-Chun Wang added a comment - Daniel Owen I have re-run the test with build 7.1.0-1650, and the results are consistent.  http://showfast.sc.couchbase.com/#/runs/kv_throughput_250M_insert_only_uniform_rhea_5node2/7.1.0-1650 The latest result is better, too.
            owend Daniel Owen added a comment -

            Hi Bo-Chun Wang,

            Thanks for setting off the runs. Given its a throughput test - I assume the higher the value the better.
            Also given the results are consistent will mark as resolved.

            owend Daniel Owen added a comment - Hi Bo-Chun Wang , Thanks for setting off the runs. Given its a throughput test - I assume the higher the value the better. Also given the results are consistent will mark as resolved.
            sarath Sarath Lakshman added a comment - - edited

            Daniel Owen Is the replica curr_items lagging behind active curr_items expected behavior?
            The value is higher because replica is lagging behind and active is able to process more items/sec.

            sarath Sarath Lakshman added a comment - - edited Daniel Owen Is the replica curr_items lagging behind active curr_items expected behavior? The value is higher because replica is lagging behind and active is able to process more items/sec.
            drigby Dave Rigby added a comment -

            Looking at the prometheus metrics via Grafana, it looks like the replica checkpoints are at the checkpoint high watermark (Checkpoint quota not shown is 30% of bucket quota or 1.93GB; the checkpoint high watermark is 90% of the checkpoint quota or 1.74GB:

            (Node .27 is essentially the same)

            i.e. the replica checkpoints are consuming essentially all of the Checkpoint memory quota. Typically this memory is freed either by deleting entire checkpoints once they are closed and unreferenced; or by expelling items which are clean (have been persisted) if the entire checkpoint isn't yet unreferenced (Thirdly by cursor-dropping but that only helps if the replication cursors are the ones holding up memory usage).

            Note that checkpoint memory freeing (either entire checkpoint real or item expelling) has also stopped:

            Full dashboard for reference:

            drigby Dave Rigby added a comment - Looking at the prometheus metrics via Grafana, it looks like the replica checkpoints are at the checkpoint high watermark (Checkpoint quota not shown is 30% of bucket quota or 1.93GB; the checkpoint high watermark is 90% of the checkpoint quota or 1.74GB: (Node .27 is essentially the same) i.e. the replica checkpoints are consuming essentially all of the Checkpoint memory quota. Typically this memory is freed either by deleting entire checkpoints once they are closed and unreferenced; or by expelling items which are clean (have been persisted) if the entire checkpoint isn't yet unreferenced (Thirdly by cursor-dropping but that only helps if the replication cursors are the ones holding up memory usage). Note that checkpoint memory freeing (either entire checkpoint real or item expelling) has also stopped: Full dashboard for reference:
            drigby Dave Rigby added a comment -

            Re-opening this as while the perf variation seems gone there is an issue with active & replica item counts...

            drigby Dave Rigby added a comment - Re-opening this as while the perf variation seems gone there is an issue with active & replica item counts...
            drigby Dave Rigby added a comment - - edited

            Suspect that the high replica checkpoint memory usage might be related to MB-49262 - as instead of replica checkpoint memory dropping to the checkpoint low watermark, it's hovering around the high...

            drigby Dave Rigby added a comment - - edited Suspect that the high replica checkpoint memory usage might be related to MB-49262 - as instead of replica checkpoint memory dropping to the checkpoint low watermark, it's hovering around the high...
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Checkpoint memory state at the time of cbcollect, node .26:

             ep_checkpoint_memory:                                  1718899382
             ep_checkpoint_memory_ratio:                            0.3
             ep_checkpoint_memory_recovery_lower_mark:              0.6
             ep_checkpoint_memory_recovery_upper_mark:              0.9
             ep_checkpoint_memory_unreferenced:                     0
             ep_max_size:                                           6442450944
            

            • cm_quota: 1932735283,2 (ep_max_size * ep_checkpoint_memory_ratio)
            • recovery_upper_mark_bytes: 1739461754,88 (cm_quota * recovery_upper_mark)

            (checkpoint_memory < recovery_upper_mark_bytes) so from the checkpoint PoV we aren't expected to recover anything in that state.

            But, the run isn't stuck, replication is proceeding.

            paolo.cocchi Paolo Cocchi added a comment - - edited Checkpoint memory state at the time of cbcollect, node .26: ep_checkpoint_memory: 1718899382 ep_checkpoint_memory_ratio: 0.3 ep_checkpoint_memory_recovery_lower_mark: 0.6 ep_checkpoint_memory_recovery_upper_mark: 0.9 ep_checkpoint_memory_unreferenced: 0 ep_max_size: 6442450944 cm_quota: 1932735283,2 (ep_max_size * ep_checkpoint_memory_ratio) recovery_upper_mark_bytes: 1739461754,88 (cm_quota * recovery_upper_mark) (checkpoint_memory < recovery_upper_mark_bytes) so from the checkpoint PoV we aren't expected to recover anything in that state. But, the run isn't stuck, replication is proceeding.
            paolo.cocchi Paolo Cocchi added a comment - - edited

            Update

            Sarath Lakshman

            Daniel Owen Is the replica curr_items lagging behind active curr_items expected behavior?
            The value is higher because replica is lagging behind and active is able to process more items/sec.

            That all depends on the particular test.

            If we consider a single frontend connection load, then replication (which is handled on 1 connection per replication link) is expected to perform the same.

            But, in general load tests (and this one isn't an exception) perform a bulk load over multiple frontend connections for reaching high throughput.
            Imagine a 2-node cluster, 1 replica. Maybe you load the active items over 10 frontend connections, but replication still happens on a single connection, so the replication throughput is expected to be much lower than the frontend throughput.
            In that case it's pretty normal to see replica-items lagging behind active-items during the bulk load.

            I'm repeating the test on a build that contains the fix for MB-49262 for assessing the current state.

            paolo.cocchi Paolo Cocchi added a comment - - edited Update Sarath Lakshman Daniel Owen Is the replica curr_items lagging behind active curr_items expected behavior? The value is higher because replica is lagging behind and active is able to process more items/sec. That all depends on the particular test. If we consider a single frontend connection load, then replication (which is handled on 1 connection per replication link) is expected to perform the same. But, in general load tests (and this one isn't an exception) perform a bulk load over multiple frontend connections for reaching high throughput. Imagine a 2-node cluster, 1 replica. Maybe you load the active items over 10 frontend connections, but replication still happens on a single connection, so the replication throughput is expected to be much lower than the frontend throughput. In that case it's pretty normal to see replica-items lagging behind active-items during the bulk load. I'm repeating the test on a build that contains the fix for MB-49262 for assessing the current state.

            Sarath Lakshman
            Run http://perf.jenkins.couchbase.com/job/rhea-5node2/1679/console on build 1729 (contains the fix for MB-49262).
            That seems to run smooth. I confirm that replication is still in progress when the test is shut down, but that seems the only reason for replication not completing:

            From the run console, the shutdown timing matches the time where we stop getting live stats from the cluster:

            16:18:38 2021-11-22T07:18:40 [INFO] Terminating Celery workers
            16:18:40 2021-11-22T07:18:40 [INFO] Killing the following process: celery
            16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction
            16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction
            16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction
            16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction
            16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction
            16:18:40 2021-11-22T07:23:11 [INFO] Resetting kernel memory settings
            16:23:11 2021-11-22T07:23:11 [INFO] Changing GRUB configuration
            16:23:11 2021-11-22T07:23:20 [INFO] Rebooting the node                         <--  !!
            

            paolo.cocchi Paolo Cocchi added a comment - Sarath Lakshman Run http://perf.jenkins.couchbase.com/job/rhea-5node2/1679/console on build 1729 (contains the fix for MB-49262 ). That seems to run smooth. I confirm that replication is still in progress when the test is shut down, but that seems the only reason for replication not completing: From the run console, the shutdown timing matches the time where we stop getting live stats from the cluster: 16:18:38 2021-11-22T07:18:40 [INFO] Terminating Celery workers 16:18:40 2021-11-22T07:18:40 [INFO] Killing the following process: celery 16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction 16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction 16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction 16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction 16:18:40 2021-11-22T07:18:40 [INFO] Running cbcollect_info with redaction 16:18:40 2021-11-22T07:23:11 [INFO] Resetting kernel memory settings 16:23:11 2021-11-22T07:23:11 [INFO] Changing GRUB configuration 16:23:11 2021-11-22T07:23:20 [INFO] Rebooting the node <-- !!

            Sarath Lakshman I'm resolving this as "normal behaviour" as per previous comments.
            Please see MB-36370 for details on the latest DCP performance assessment.
            Thanks

            paolo.cocchi Paolo Cocchi added a comment - Sarath Lakshman I'm resolving this as "normal behaviour" as per previous comments. Please see MB-36370 for details on the latest DCP performance assessment. Thanks
            drigby Dave Rigby added a comment -

            Thanks Paolo Cocchi for looking into this.

            Bo-Chun Wang Note that Paolo observed the following in the last test run:

            I confirm that replication is still in progress when the test is shut down, but that seems the only reason for replication not completing:

            I think we need to look into this - I would expect the that the test should wait until replication has completed. Otherwise we potentially tracking misleading metrics if the work to replicate fully is not included in the measurements..

            As a concrete example, say we make some change in KV-Engine which means we can process client requests (and persist them to disk) 10% faster; but at the same time we slow down DCP replication by 20%. That might give a score on this test which is 10% better, however in reality we have overall regressed the system.

            drigby Dave Rigby added a comment - Thanks Paolo Cocchi for looking into this. Bo-Chun Wang Note that Paolo observed the following in the last test run: I confirm that replication is still in progress when the test is shut down, but that seems the only reason for replication not completing: I think we need to look into this - I would expect the that the test should wait until replication has completed. Otherwise we potentially tracking misleading metrics if the work to replicate fully is not included in the measurements.. As a concrete example, say we make some change in KV-Engine which means we can process client requests (and persist them to disk) 10% faster; but at the same time we slow down DCP replication by 20%. That might give a score on this test which is 10% better, however in reality we have overall regressed the system.

            Thanks Paolo Cocchi Dave Rigby. I think we should update the test to wait for replication to finish for measuring throughput to report a realistic number.

            sarath Sarath Lakshman added a comment - Thanks Paolo Cocchi Dave Rigby . I think we should update the test to wait for replication to finish for measuring throughput to report a realistic number.
            bo-chun.wang Bo-Chun Wang added a comment - - edited

            Currently, our throughput tests are measuring how many client requests the cluster can handle, and I think the number is important. The same concept is applied to latency tests. We measure how fast client requests can be completed. The latency is measured when the requests are completed but not when replication is done. I agreed DCP replication performance is important, but we should measure the two things separately. 

            1. client request process performance (the throughput we measure)
            2. dcp replication performance

            If the client request process performance is 10% slower and DCP replication performance is 20% faster, I don't think it means the overall system performance is better.

            Rebalance tests are one method we use to measure DCP replication performance. If there are any ideas about how we should measure DCP replication performance, please let me know. For this particular test (insert only), another thing we can try is to do a run without replication. 

            bo-chun.wang Bo-Chun Wang added a comment - - edited Currently, our throughput tests are measuring how many client requests the cluster can handle, and I think the number is important. The same concept is applied to latency tests. We measure how fast client requests can be completed. The latency is measured when the requests are completed but not when replication is done. I agreed DCP replication performance is important, but we should measure the two things separately.  client request process performance (the throughput we measure) dcp replication performance If the client request process performance is 10% slower and DCP replication performance is 20% faster, I don't think it means the overall system performance is better. Rebalance tests are one method we use to measure DCP replication performance. If there are any ideas about how we should measure DCP replication performance, please let me know. For this particular test (insert only), another thing we can try is to do a run without replication. 

            We ideally want to measure the sustainable write throughput (which includes replication not lagging). This will require the test framework to slow down and adjust the requests rate sufficient enough to balance replication. Maybe reducing the number of clients may slightly help here. If we wait for replication to complete for the measurement, the throughput number itself will not be relevant, but it can be used as a number for catching regressions.
            Since this is a sanity test for the purpose of specially measuring storage engine insert rate, changing the test to no replica config also should be okay.

            sarath Sarath Lakshman added a comment - We ideally want to measure the sustainable write throughput (which includes replication not lagging). This will require the test framework to slow down and adjust the requests rate sufficient enough to balance replication. Maybe reducing the number of clients may slightly help here. If we wait for replication to complete for the measurement, the throughput number itself will not be relevant, but it can be used as a number for catching regressions. Since this is a sanity test for the purpose of specially measuring storage engine insert rate, changing the test to no replica config also should be okay.
            drigby Dave Rigby added a comment -

            I agree with both of you - we want to see how fast KV-Engine can process client requests (and under what latency), but we also want to measure how fast KV-Engine can handle the full lifetime of a request (processing off network, writing to local disk; replicating to remote cluster and writing to disk there).

            As Bo-Chun suggested, the first type of test doesn't really care about replication; so we could simply run without replicas; measuring the throughput where we just process the request on the active node and write it to disk.

            The second type of test is pretty much the test we have here (I believe); however we do want to make sure the performance of the system can be sustained; and we are not building up backlogs which only recover when the workload stops running - i.e. if the test ran for a longer period we would see a drop in throughput long-term.

            Note that KV-Engine will already apply backpressure to the client(s) if persistence / replication cannot keep up (and Bucket memory has been saturated). This back pressure should be better behaved in Neo due to the changes done by Paolo Cocchi and James Harrison recently - namely giving outstanding mutations to be written to disk / replicated (CheckpointManager) their own sub-quota of the overall Bucket quota. We are still tuning this, but currently the CheckpointManager gets a maximum of 30% of the overall Bucket quota in Neo; whereas previously it could in theory consume the entire Bucket quota (leaving essentially zero memory for caching recently accessed items / disk backfills for other DCP clients etc).
            I mention this because if we have perf tests which KV-Engine cannot sustain the operation rate, but for the (short) duration they run there was sufficient Bucket Quota in 7.0.0 to cache dirty / queued for replication items, we could observe lower throughput under Neo as we would start applying back-pressure to clients "sooner". I would argue that this isn't really a regression, the test simply didn't run for long enough in the first place as a similar drop in throughput would be seen in 7.0.0, if the test duration was extended.

            drigby Dave Rigby added a comment - I agree with both of you - we want to see how fast KV-Engine can process client requests (and under what latency), but we also want to measure how fast KV-Engine can handle the full lifetime of a request (processing off network, writing to local disk; replicating to remote cluster and writing to disk there). As Bo-Chun suggested, the first type of test doesn't really care about replication; so we could simply run without replicas; measuring the throughput where we just process the request on the active node and write it to disk. The second type of test is pretty much the test we have here (I believe); however we do want to make sure the performance of the system can be sustained; and we are not building up backlogs which only recover when the workload stops running - i.e. if the test ran for a longer period we would see a drop in throughput long-term. Note that KV-Engine will already apply backpressure to the client(s) if persistence / replication cannot keep up (and Bucket memory has been saturated). This back pressure should be better behaved in Neo due to the changes done by Paolo Cocchi and James Harrison recently - namely giving outstanding mutations to be written to disk / replicated (CheckpointManager) their own sub-quota of the overall Bucket quota. We are still tuning this, but currently the CheckpointManager gets a maximum of 30% of the overall Bucket quota in Neo; whereas previously it could in theory consume the entire Bucket quota (leaving essentially zero memory for caching recently accessed items / disk backfills for other DCP clients etc). I mention this because if we have perf tests which KV-Engine cannot sustain the operation rate, but for the (short) duration they run there was sufficient Bucket Quota in 7.0.0 to cache dirty / queued for replication items, we could observe lower throughput under Neo as we would start applying back-pressure to clients "sooner". I would argue that this isn't really a regression, the test simply didn't run for long enough in the first place as a similar drop in throughput would be seen in 7.0.0, if the test duration was extended.
            paolo.cocchi Paolo Cocchi added a comment -

            Just a small addition. I do get Bo's PoV on isolating "client request process performance" tests from "dcp replication performance". But the problem is that the frontend throughput is directly affected by replication being in place (for a number of reasons, eg frontend and replication race on Checkpoints). That is easily observable even on a very simple cluster_run. So I'm not sure how representative a 0-replica ingestion test would be.

            paolo.cocchi Paolo Cocchi added a comment - Just a small addition. I do get Bo's PoV on isolating "client request process performance" tests from "dcp replication performance". But the problem is that the frontend throughput is directly affected by replication being in place (for a number of reasons, eg frontend and replication race on Checkpoints). That is easily observable even on a very simple cluster_run. So I'm not sure how representative a 0-replica ingestion test would be.

            Closing all Duplicates, Not a Bug, Incomplete, Duplicate

            ritam.sharma Ritam Sharma added a comment - Closing all Duplicates, Not a Bug, Incomplete, Duplicate

            People

              bo-chun.wang Bo-Chun Wang
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty