Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48834

Improve XDCR performance with Magma

    XMLWordPrintable

Details

    Description

      I re-run two existing XDCR tests with Magma. Compared to Couchstore, Magma performance is about 50% lower. I open this ticket to track XDCR+Magma performance improvement. All runs were running on build 7.1.0-1401.

       

      Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB

       

       

      Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 250M x 1KB

       

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-48834
          # Subject Branch Project Status CR V

          Activity

            bo-chun.wang Bo-Chun Wang created issue -
            jliang John Liang added a comment - - edited

            It is 100% resdient ratio. XDCR works off DCP and memcached API. So it is unlikely a XDCR issue though.

            jliang John Liang added a comment - - edited It is 100% resdient ratio. XDCR works off DCP and memcached API. So it is unlikely a XDCR issue though.
            jliang John Liang made changes -
            Field Original Value New Value
            Assignee John Liang [ jliang ] Neil Huang [ neil.huang ]
            srinath.duvuru Srinath Duvuru made changes -
            Issue Type Improvement [ 4 ] Bug [ 1 ]
            jliang John Liang made changes -
            Assignee Neil Huang [ neil.huang ] Lilei Chen [ lilei.chen ]
            wayne Wayne Siu made changes -
            Assignee Lilei Chen [ lilei.chen ] Bo-Chun Wang [ bo-chun.wang ]
            wayne Wayne Siu added a comment -

            Bo-Chun Wang

            Can you check the performance on DCP?  Thanks.

            wayne Wayne Siu added a comment - Bo-Chun Wang Can you check the performance on DCP?  Thanks.
            bo-chun.wang Bo-Chun Wang made changes -
            bo-chun.wang Bo-Chun Wang made changes -
            bo-chun.wang Bo-Chun Wang made changes -
            bo-chun.wang Bo-Chun Wang made changes -
            bo-chun.wang Bo-Chun Wang made changes -
            bo-chun.wang Bo-Chun Wang added a comment - - edited

            I take a look at the both tests, and I see similar behavior.

            Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB

            Source:http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_c1_710-1401_init_xdcr_7224&label=couchstore&snapshot=titan_c1_710-1401_init_xdcr_fcee&label=magma

            Destination:http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_c2_710-1401_init_xdcr_4ba7&label=couchstore&snapshot=titan_c2_710-1401_init_xdcr_cf1f&label=magma

            I see Couchbase has higher DCP drain rate at the source (c1).

            Both runs have 100% resident ratio. However, I see the Magma run is reading data from disk and the Couchstore run doesn't do it. It results in higher disk utilization in the Magma run.

            At the destination (c2), I see the Magma run has bg wait time and the Couchstore doesn't have it.

            I assign the ticket to the KV team so they can take a look at it.

            bo-chun.wang Bo-Chun Wang added a comment - - edited I take a look at the both tests, and I see similar behavior. Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB Storage XDCR rate Job Couchstore 141381 http://perf.jenkins.couchbase.com/job/titan/12218/ Magma 79794 http://perf.jenkins.couchbase.com/job/titan/12214/ Source: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_c1_710-1401_init_xdcr_7224&label=couchstore&snapshot=titan_c1_710-1401_init_xdcr_fcee&label=magma Destination: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_c2_710-1401_init_xdcr_4ba7&label=couchstore&snapshot=titan_c2_710-1401_init_xdcr_cf1f&label=magma I see Couchbase has higher DCP drain rate at the source (c1). Both runs have 100% resident ratio. However, I see the Magma run is reading data from disk and the Couchstore run doesn't do it. It results in higher disk utilization in the Magma run. At the destination (c2), I see the Magma run has bg wait time and the Couchstore doesn't have it. I assign the ticket to the KV team so they can take a look at it.
            bo-chun.wang Bo-Chun Wang made changes -
            Assignee Bo-Chun Wang [ bo-chun.wang ] Daniel Owen [ owend ]
            wayne Wayne Siu made changes -
            Labels magma xdcr magma performance xdcr
            jliang John Liang made changes -
            Component/s couchbase-bucket [ 10173 ]
            owend Daniel Owen made changes -
            Rank Ranked higher
            owend Daniel Owen made changes -
            Component/s XDCR [ 10136 ]
            owend Daniel Owen made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            owend Daniel Owen made changes -
            Sprint KV 2021-Dec [ 1906 ]
            owend Daniel Owen made changes -
            Rank Ranked lower
            owend Daniel Owen added a comment - - edited

            Focusing on Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB test, which is single node

            Source (c1) - .105
            Destination (c2) - .100

            On the Destination only WANRINGs wee see is 25 Slow messages for Destroying closed unreferenced checkpoints. All but 3 are < 1 second. The exceptions are:

            2021-10-07T19:00:02.419289-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 1079 ms
            2021-10-07T19:00:07.453254-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 5034 ms
            2021-10-07T19:00:35.328182-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 5100 ms
            

            Therefore the focus is on the source side.

            owend Daniel Owen added a comment - - edited Focusing on Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB test, which is single node Source (c1) - .105 Destination (c2) - .100 On the Destination only WANRINGs wee see is 25 Slow messages for Destroying closed unreferenced checkpoints . All but 3 are < 1 second. The exceptions are: 2021-10-07T19:00:02.419289-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 1079 ms 2021-10-07T19:00:07.453254-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 5034 ms 2021-10-07T19:00:35.328182-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 5100 ms Therefore the focus is on the source side.
            owend Daniel Owen added a comment - - edited

            Couchstore (node .105)
            1024 backfills scheduled at T00:47:49
            backfill complete - T00:57:23 to T00:59:24

            Couchstore backfills take between 10 and 12 minutes

            Magma (node .105)
            1024 backfill scheduled at T18:44:24
            backfill complete - T19:01:25 to T19:05:03

            Magma backfills take between 17 and 21 minutes

            owend Daniel Owen added a comment - - edited Couchstore (node .105) 1024 backfills scheduled at T00:47:49 backfill complete - T00:57:23 to T00:59:24 Couchstore backfills take between 10 and 12 minutes Magma (node .105) 1024 backfill scheduled at T18:44:24 backfill complete - T19:01:25 to T19:05:03 Magma backfills take between 17 and 21 minutes
            owend Daniel Owen made changes -
            Attachment magma-105.png [ 171480 ]
            owend Daniel Owen added a comment - - edited

            Focusing on Magma run.
            Memory from KV perspective for source node is

            owend Daniel Owen added a comment - - edited Focusing on Magma run. Memory from KV perspective for source node is
            owend Daniel Owen added a comment - - edited

            However in the memcached.log we repeatedly see the following message a total of 30K times:

            2021-10-07T18:44:25.932084-07:00 WARNING (bucket-1) MagmaKVStore::scan lookup->callback vb:144 key:<ud>cid:0x0:9e6655-000004112907</ud> returned cb::engine_errc::no_memory
            ...
            2021-10-07T19:05:03.447289-07:00 WARNING (bucket-1) MagmaKVStore::scan lookup->callback vb:417 key:<ud>cid:0x0:9e6655-000096367441</ud> returned cb::engine_errc::no_memory
            

            Update: After speaking to Ben Huddleston - these messages can be ignored, just means DCP buffer is full

            The logging has been addressed in https://review.couchbase.org/c/kv_engine/+/166762 - Thanks Ben Huddleston

            owend Daniel Owen added a comment - - edited However in the memcached.log we repeatedly see the following message a total of 30K times: 2021-10-07T18:44:25.932084-07:00 WARNING (bucket-1) MagmaKVStore::scan lookup->callback vb:144 key:<ud>cid:0x0:9e6655-000004112907</ud> returned cb::engine_errc::no_memory ... 2021-10-07T19:05:03.447289-07:00 WARNING (bucket-1) MagmaKVStore::scan lookup->callback vb:417 key:<ud>cid:0x0:9e6655-000096367441</ud> returned cb::engine_errc::no_memory Update: After speaking to Ben Huddleston - these messages can be ignored, just means DCP buffer is full The logging has been addressed in https://review.couchbase.org/c/kv_engine/+/166762 - Thanks Ben Huddleston
            owend Daniel Owen made changes -
            Epic Link MB-30659 [ 88207 ]
            Is this a Regression? Yes [ 10450 ]
            owend Daniel Owen made changes -
            Attachment magma-105-sentitems.png [ 171828 ]
            owend Daniel Owen added a comment -

            Node .105 on the Magma run

            owend Daniel Owen added a comment - Node .105 on the Magma run
            owend Daniel Owen made changes -
            Attachment couchstore-105-sentitems.png [ 171834 ]
            owend Daniel Owen added a comment -

            Node .105 on the couchstore run

            owend Daniel Owen added a comment - Node .105 on the couchstore run
            owend Daniel Owen made changes -
            Attachment magma-105-backfills.png [ 171835 ]
            owend Daniel Owen made changes -
            Attachment couchstore-105-backfills.png [ 171836 ]
            owend Daniel Owen added a comment -

            Final observation is that we see on magma .105 - we see a couple of very slow runtimes.

            2021-10-07T18:41:34.374270-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool4: 88 s
            2021-10-07T18:43:50.265101-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 136 s
            

            On discussing with James Harrison we looked as the Task runtimes and see the slowest CheckpointDestroyerTask[NonIO] is

                 327ms - 5505ms : (100.0000%)   1
            

            So it may be an issue with the reporting of Slow runtimes - however warrants further investigation.

            But in summary from the investigation so far - it is reasonable to conclude that the slowdown is due to backfills taking nearly 2x longer with Magma.

            owend Daniel Owen added a comment - Final observation is that we see on magma .105 - we see a couple of very slow runtimes. 2021-10-07T18:41:34.374270-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool4: 88 s 2021-10-07T18:43:50.265101-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 136 s On discussing with James Harrison we looked as the Task runtimes and see the slowest CheckpointDestroyerTask [NonIO] is 327ms - 5505ms : (100.0000%) 1 So it may be an issue with the reporting of Slow runtimes - however warrants further investigation. But in summary from the investigation so far - it is reasonable to conclude that the slowdown is due to backfills taking nearly 2x longer with Magma.
            owend Daniel Owen made changes -
            Status In Progress [ 3 ] Open [ 1 ]
            owend Daniel Owen made changes -
            Triage Triaged [ 10350 ]
            owend Daniel Owen made changes -
            Sprint KV 2021-Dec [ 1906 ]
            owend Daniel Owen made changes -
            Rank Ranked higher
            owend Daniel Owen made changes -
            Component/s couchbase-bucket [ 10173 ]
            Component/s storage-engine [ 10175 ]
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Srinath Duvuru [ srinath.duvuru ]
            srinath.duvuru Srinath Duvuru made changes -
            Assignee Srinath Duvuru [ srinath.duvuru ] Sarath Lakshman [ sarath ]
            james.harrison James Harrison made changes -
            Link This issue is duplicated by MB-48569 [ MB-48569 ]

            Bo-Chun Wang Can we rerun the test with all graphs enabled (all the ones we generally run for magma perf tests)?

            Do we have an XDCR magma test with a lower residence ratio and larger data density?
            Are we seeing similar degradation in those tests?

            For 100% in-memory tests, given the current magma design, degradation may be expected.
            For magma we store key and value together in seqIndex. Even if we can fetch the value from kv-engine inmemory cache, the value read will happen from disk. So, for couchstore it incurs only the cost of reading keys while magma has to read both key and value. For lower resident buckets, magma's IO cost for fetching values should be lower than couchstore.

            sarath Sarath Lakshman added a comment - Bo-Chun Wang Can we rerun the test with all graphs enabled (all the ones we generally run for magma perf tests)? Do we have an XDCR magma test with a lower residence ratio and larger data density? Are we seeing similar degradation in those tests? For 100% in-memory tests, given the current magma design, degradation may be expected. For magma we store key and value together in seqIndex. Even if we can fetch the value from kv-engine inmemory cache, the value read will happen from disk. So, for couchstore it incurs only the cost of reading keys while magma has to read both key and value. For lower resident buckets, magma's IO cost for fetching values should be lower than couchstore.
            jliang John Liang added a comment -

            Sarath Lakshman The degradation is 50%. If it is 100% in-memory, why it needs to fetch key and value from disk?

            jliang John Liang added a comment - Sarath Lakshman The degradation is 50%. If it is 100% in-memory, why it needs to fetch key and value from disk?

            Even though all key-values are available in kv-engine cache, the seqIndex has to be used to read in bySeqno order.

            The degradation is a problem we need to think through more. The in-memory lookup/skip value read is not something we thought through in the design. The degradation could happen due to the layout of kv-pairs and index stored on disk.

            For magma seqIndex, we pack 4KB worth of kv pairs in an sstable data block. Then, index blocks point to the data block. When we do a bySeqno iteration, we have to read all the data blocks even if we have do not use-values. The seqindex iterator has to return key, metadata. Since we store key,meta,value contiguously together, I/O read is unavoidable. For couchstore, values are stored separately, hence it can skip extra read I/O optionally if we do not want to read the value. We may have to think about some index design changes to overcome this problem. But, it is likely a difficult problem to solve.

            sarath Sarath Lakshman added a comment - Even though all key-values are available in kv-engine cache, the seqIndex has to be used to read in bySeqno order. The degradation is a problem we need to think through more. The in-memory lookup/skip value read is not something we thought through in the design. The degradation could happen due to the layout of kv-pairs and index stored on disk. For magma seqIndex, we pack 4KB worth of kv pairs in an sstable data block. Then, index blocks point to the data block. When we do a bySeqno iteration, we have to read all the data blocks even if we have do not use-values. The seqindex iterator has to return key, metadata. Since we store key,meta,value contiguously together, I/O read is unavoidable. For couchstore, values are stored separately, hence it can skip extra read I/O optionally if we do not want to read the value. We may have to think about some index design changes to overcome this problem. But, it is likely a difficult problem to solve.
            jliang John Liang added a comment -

            Got it.

            jliang John Liang added a comment - Got it.

            I will do non-DGM/DGM runs with kvstore stats enabled. 

            bo-chun.wang Bo-Chun Wang added a comment - I will do non-DGM/DGM runs with kvstore stats enabled. 
            bo-chun.wang Bo-Chun Wang added a comment - - edited

            I have re-run the tests and collected kvstats. All runs are running with build  7.1.0-1885. Couchstore has better performance in both tests. Note that, there is a regression in XDCR tests (MB-50016) so the numbers are lower than previous ones.

             

            Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 1G x 1KB, DGM

             

            Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 250M x 1KB

            bo-chun.wang Bo-Chun Wang added a comment - - edited I have re-run the tests and collected kvstats. All runs are running with build  7.1.0-1885. Couchstore has better performance in both tests. Note that, there is a regression in XDCR tests ( MB-50016 ) so the numbers are lower than previous ones.   Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 1G x 1KB, DGM Storage XDCR rate Job Magma 236860 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/67/ Couchstore 469036 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/70/   Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 250M x 1KB Storage XDCR rate Job Magma 227256 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/69/ Couchstore 358389 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/68/

            Bo-Chun Wang Can we run a variant of the experiment, 1 bucket x 1G x 1KB, DGM with source bucket as couchstore and destination bucket as magma ?

            sarath Sarath Lakshman added a comment - Bo-Chun Wang Can we run a variant of the experiment, 1 bucket x 1G x 1KB, DGM with source bucket as couchstore and destination bucket as magma ?
            bo-chun.wang Bo-Chun Wang added a comment - - edited

            I finished a run. The source bucket is using couchstore, and the destination bucket is using magma. The result is similar to the run using magma for both buckets.

            Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 1G x 1KB, DGM

            bo-chun.wang Bo-Chun Wang added a comment - - edited I finished a run. The source bucket is using couchstore, and the destination bucket is using magma. The result is similar to the run using magma for both buckets. Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 1G x 1KB, DGM Storage XDCR rate Job Magma -> Magma 236860 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/67/ Couchstore -> Couchstore 469036 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/70/ Couchstore -> Magma 269379 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/71/ Magma -> Couchstore 494571 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/72/  

            Thanks Bo-Chun Wang. Can we do a run with source=magma and dest=couchstore as well?

            sarath Sarath Lakshman added a comment - Thanks Bo-Chun Wang . Can we do a run with source=magma and dest=couchstore as well?
            bo-chun.wang Bo-Chun Wang added a comment -

            I have added new result to the table above.

            bo-chun.wang Bo-Chun Wang added a comment - I have added new result to the table above.
            jliang John Liang added a comment -

            Bo-Chun WangIs there a similar test for optimistic replication? If so, can we also do a run with couchstore-magma on optimistic replication? Thanks.

            jliang John Liang added a comment - Bo-Chun Wang Is there a similar test for optimistic replication? If so, can we also do a run with couchstore-magma on optimistic replication? Thanks.
            bo-chun.wang Bo-Chun Wang made changes -
            bo-chun.wang Bo-Chun Wang added a comment -

            We don't have DGM tests for optimistic replication. I will re-run this non-DGM test with couchbastore-magma.

            bo-chun.wang Bo-Chun Wang added a comment - We don't have DGM tests for optimistic replication. I will re-run this non-DGM test with couchbastore-magma.
            jliang John Liang added a comment -

            For normal replication, it will perform a read (before write) on every mutation. For optimistic replication, it will only perform write. So just to see if there is any difference.

            jliang John Liang added a comment - For normal replication, it will perform a read (before write) on every mutation. For optimistic replication, it will only perform write. So just to see if there is any difference.
            sarath Sarath Lakshman made changes -
            sarath Sarath Lakshman made changes -

            Comparison between magma and couchstore as destination
            http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_c2_710-1885_init_xdcr_2773&label=magma_dst&snapshot=titan_c2_710-1885_init_xdcr_389b&label=couch_dst

            The following plot gives a good explanation why couchstore destination is fast

            There is plenty of free memory. Couchstore data files are 100% cached. Hence, reads performed during btree writes do not incur any i/o.
            Magma uses direct i/o during writes and hence, it requires an I/O the first time a block is read. But I suspect some ineffectiveness for magma to aggressively take advantage of the page cache. Even after it runs for the entire duration, the amount of data cached is very low. I will investigate this further.

            sarath Sarath Lakshman added a comment - Comparison between magma and couchstore as destination http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_c2_710-1885_init_xdcr_2773&label=magma_dst&snapshot=titan_c2_710-1885_init_xdcr_389b&label=couch_dst The following plot gives a good explanation why couchstore destination is fast There is plenty of free memory. Couchstore data files are 100% cached. Hence, reads performed during btree writes do not incur any i/o. Magma uses direct i/o during writes and hence, it requires an I/O the first time a block is read. But I suspect some ineffectiveness for magma to aggressively take advantage of the page cache. Even after it runs for the entire duration, the amount of data cached is very low. I will investigate this further.
            srinath.duvuru Srinath Duvuru made changes -
            Issue Type Bug [ 1 ] Task [ 3 ]

            Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB, Optimistic

            bo-chun.wang Bo-Chun Wang added a comment - Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB, Optimistic Storage XDCR rate Job Couchstore 149460 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/73/ Magma 46902 http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/74/
            jliang John Liang added a comment -

            Sarath LakshmanIf the traffic is sequential in order (for both seq and doc key), it won't need a lot of requiring a lot of page caching, right? Also, note that it is backfill, so there is no data at the target cluster to begin with.

            jliang John Liang added a comment - Sarath Lakshman If the traffic is sequential in order (for both seq and doc key), it won't need a lot of requiring a lot of page caching, right? Also, note that it is backfill, so there is no data at the target cluster to begin with.
            sarath Sarath Lakshman added a comment - - edited

            The source cluster (dcp/backfill) is doing well as it is sequential read. For magma, the problem is at the destination. Every document write operation requires a disk lookup to maintain the count. In the case of couchstore, 100% of the disk blocks are cached in page cache as there is plenty of memory. For magma, the read IOs are slowing down the writes.

            In this case since all operations are inserts, we may not be doing disk lookups as the bloom filter helps there. But, the compactons are incurring read I/Os.

            sarath Sarath Lakshman added a comment - - edited The source cluster (dcp/backfill) is doing well as it is sequential read. For magma, the problem is at the destination. Every document write operation requires a disk lookup to maintain the count. In the case of couchstore, 100% of the disk blocks are cached in page cache as there is plenty of memory. For magma, the read IOs are slowing down the writes. In this case since all operations are inserts, we may not be doing disk lookups as the bloom filter helps there. But, the compactons are incurring read I/Os.
            sarath Sarath Lakshman made changes -

            For magma destination, the write queue is not building up. That indicates there aren't enough mutations coming to the storage engine at higher rate.

            sarath Sarath Lakshman added a comment - For magma destination, the write queue is not building up. That indicates there aren't enough mutations coming to the storage engine at higher rate.

            For magma, I noticed bg fetches happening on the destination cluster. But for couchstore, there are no bg fetches happening.
            This appears to be related to the bloom filter available in kv-engine for couchstore. For set_with_meta / get_meta operation, couchstore returns not-found immediately by checking the bloom filter. In the case of magma, it queues a bg fetch to find out an item does not exist (Internally bg fetch results in checking magma bloomfilter). The extra bg fetches are resulting in lower XDCR throughput on the destination cluster. This cluster is running value-only eviction.

            Daniel Owen For value-only eviction, we can avoid any bg fetch for reading doc metadata right?
            Looking at the code, we do a value eviction check for get API, but not in all cases for setWithMeta and getMeta API.
            https://github.com/couchbase/kv_engine/blob/master/engines/ep/src/vbucket.cc#L2926
            https://github.com/couchbase/kv_engine/blob/master/engines/ep/src/vbucket.cc#L2008

            sarath Sarath Lakshman added a comment - For magma, I noticed bg fetches happening on the destination cluster. But for couchstore, there are no bg fetches happening. This appears to be related to the bloom filter available in kv-engine for couchstore. For set_with_meta / get_meta operation, couchstore returns not-found immediately by checking the bloom filter. In the case of magma, it queues a bg fetch to find out an item does not exist (Internally bg fetch results in checking magma bloomfilter). The extra bg fetches are resulting in lower XDCR throughput on the destination cluster. This cluster is running value-only eviction. Daniel Owen For value-only eviction, we can avoid any bg fetch for reading doc metadata right? Looking at the code, we do a value eviction check for get API, but not in all cases for setWithMeta and getMeta API. https://github.com/couchbase/kv_engine/blob/master/engines/ep/src/vbucket.cc#L2926 https://github.com/couchbase/kv_engine/blob/master/engines/ep/src/vbucket.cc#L2008
            owend Daniel Owen added a comment -

            Hi Sarath Lakshman,

            Many thanks for your analysis
            I agree for value-only eviction we should not require a bg fetch as we should just be able to examine the hash table.

            Feel free to assign back to me.
            Changing the component from storage_engine to couchbase_bucket.

            owend Daniel Owen added a comment - Hi Sarath Lakshman , Many thanks for your analysis I agree for value-only eviction we should not require a bg fetch as we should just be able to examine the hash table. Feel free to assign back to me. Changing the component from storage_engine to couchbase_bucket.
            owend Daniel Owen made changes -
            Component/s storage-engine [ 10175 ]
            Component/s couchbase-bucket [ 10173 ]
            owend Daniel Owen added a comment -

            I synced-up with Dave Rigby

            We only keep alive items in hashtable in general (they can be present temporarily if someone requests a deleted doc metadata)
            So in the getMeta case Sarath mentions, even with value eviction if the item isn't resident in HT we must go to disk (which can potentially be skipped if the bloom filter tells us there's no such tombstone for that key). See for example https://github.com/couchbase/kv_engine/blob/f9016f1b4acc2dfd1ef911e8a7424fefd95fd0f1/engines/ep/src/vbucket.cc#L2911
            Where we return if deleted or not (potentially after a bgfetch when we call getMeta a second time)

            Sarath Lakshman do you agree that its worth having a delete-only bloom filter in ep-engine of value for magma value eviction?

            thanks

            owend Daniel Owen added a comment - I synced-up with Dave Rigby We only keep alive items in hashtable in general (they can be present temporarily if someone requests a deleted doc metadata) So in the getMeta case Sarath mentions, even with value eviction if the item isn't resident in HT we must go to disk (which can potentially be skipped if the bloom filter tells us there's no such tombstone for that key). See for example https://github.com/couchbase/kv_engine/blob/f9016f1b4acc2dfd1ef911e8a7424fefd95fd0f1/engines/ep/src/vbucket.cc#L2911 Where we return if deleted or not (potentially after a bgfetch when we call getMeta a second time) Sarath Lakshman do you agree that its worth having a delete-only bloom filter in ep-engine of value for magma value eviction? thanks

            Thanks Daniel Owen.

            If I understand correctly, if we have to avoid bg fetch on non-exist keys for value-only eviction, we need to keep a bloom filter to address the special case of deleted docs. Does the deleted doc mean tombstone document?

            In this specific XDCR test case, setWithMeta is the one triggering bgFetch.

            For couchstore, we rebuild the bloom filter every time full compaction happens. For magma, when the logically deleted doc is removed, we have to remove it from the bloom filter as well. But, bloom filter does not support a remove operation. Since magma does not have periodic full compaction, we may not be able to rebuild the bloom filter in KV-Engine.

            Magma internally maintains bloom filter per sstable for the key existence check, we could expose this bloom filter through a magma KeyMayExist API that only checks in-memory bloom filter without any I/O. Essentially when we queue a bgFetch, it checks against this bloom filter to respond not-found. I wonder if we directly expose this API and avoid bgFetch queueing code path, whether that would help improve the throughput.

            sarath Sarath Lakshman added a comment - Thanks Daniel Owen . If I understand correctly, if we have to avoid bg fetch on non-exist keys for value-only eviction, we need to keep a bloom filter to address the special case of deleted docs. Does the deleted doc mean tombstone document? In this specific XDCR test case, setWithMeta is the one triggering bgFetch. For couchstore, we rebuild the bloom filter every time full compaction happens. For magma, when the logically deleted doc is removed, we have to remove it from the bloom filter as well. But, bloom filter does not support a remove operation. Since magma does not have periodic full compaction, we may not be able to rebuild the bloom filter in KV-Engine. Magma internally maintains bloom filter per sstable for the key existence check, we could expose this bloom filter through a magma KeyMayExist API that only checks in-memory bloom filter without any I/O. Essentially when we queue a bgFetch, it checks against this bloom filter to respond not-found. I wonder if we directly expose this API and avoid bgFetch queueing code path, whether that would help improve the throughput.
            drigby Dave Rigby added a comment -

            Sarath Lakshman Does the Magma bloom filter track logically deleted (tombstones) keys, or just alive keys?

            If it's the latter, then I don't think exposing an API to query it directly would necessarily make much difference - the issue with SetWithMeta (and GetMeta) is even if the item being compared has been deleted, we need to compare CAS (or revId) as the deleted item could still be the "winning" mutation. As such, it's not valid to check the bloom filter to see an an alive item exists, as we also need deleted ones.

            drigby Dave Rigby added a comment - Sarath Lakshman Does the Magma bloom filter track logically deleted (tombstones) keys, or just alive keys? If it's the latter, then I don't think exposing an API to query it directly would necessarily make much difference - the issue with SetWithMeta (and GetMeta) is even if the item being compared has been deleted, we need to compare CAS (or revId) as the deleted item could still be the "winning" mutation. As such, it's not valid to check the bloom filter to see an an alive item exists, as we also need deleted ones.

            Got it. So the logically deleted document is also a party in conflict resolution/CAS.

            Magma bloom filter tracks the existence of a doc key item (it can be either tombstone or alive key) - not just the alive keys.
            The KeyMayExist API will have to do a lookup in the memtable (we do not maintain a bloomfilter for memtable), then binary search in sstable lists in each level. This is certainly more time consuming than the kv-engine full key range bloom filter. These checks happen during magma::Get() during every bgFetch to identify the sstable to read. Adding this API will only help in skipping the bgFetch code path for the non-existent key check. We can give it a try.

            sarath Sarath Lakshman added a comment - Got it. So the logically deleted document is also a party in conflict resolution/CAS. Magma bloom filter tracks the existence of a doc key item (it can be either tombstone or alive key) - not just the alive keys. The KeyMayExist API will have to do a lookup in the memtable (we do not maintain a bloomfilter for memtable), then binary search in sstable lists in each level. This is certainly more time consuming than the kv-engine full key range bloom filter. These checks happen during magma::Get() during every bgFetch to identify the sstable to read. Adding this API will only help in skipping the bgFetch code path for the non-existent key check. We can give it a try.
            drigby Dave Rigby added a comment -

            Ok, so sounds like Magma's BloomFilter behaves like the ep-engine Bloom Filter in full-eviction's mode, assuming resident ratio is below bfilter_residency_threshold which defaults to 10%:

            • see BloomFilterCallback::callback():

                          if (store.getItemEvictionPolicy() == EvictionPolicy::Value) {
                              /**
                               * VALUE-ONLY EVICTION POLICY
                               * Consider deleted items only.
                               */
                              if (isDeleted) {
                                  vb->addToTempFilter(key);
                              }
                          } else {
                              /**
                               * FULL EVICTION POLICY
                               * If vbucket's resident ratio is found to be less than
                               * the residency threshold, consider all items, otherwise
                               * consider deleted and non-resident items only.
                               */
                              bool residentRatioLessThanThreshold =
                                      vb->isResidentRatioUnderThreshold(
                                              store.getBfiltersResidencyThreshold());
                              if (residentRatioLessThanThreshold) {
                                  vb->addToTempFilter(key);
                              } else {
                                  if (isDeleted || !store.isMetaDataResident(vb, key)) {
                                      vb->addToTempFilter(key);
                                  }
                              }
                          }
              

            As you point out, there isn't the opportunity with Magma to rebuild the ep-engine level Bloom filters during full compaction, so it's not feasible to re-enable the ep-engine value-eviction delete-only Bloom filter when Magma is being used.

            One slightly surprising thing is the slowdown seen when performing the Magma bgFetch for the deleted item - assuming the key being looked up truly doesn't exist (which I'm assuming based on the workload being called "initial load"), then I would expect that Magma should be able to return that information quickly, if it is just checking memtables and (pinned IIRC?) Bloom filter block reads for each SST file. Yes there is the overhead of queuing and managing the BGFetch task, but I wouldn't expect that should be too large.

            If you have the logs to hand, you could look at the bg_wait and bg_load histograms on the destination cluster - those are the times each BGFetcher task was waiting to run, and then how long it ran for. Assuming we won't be able to reduce the BGLoad time much with a new API (it'll be doing essentially the same work, just without the task management), we should see how BGLoad compares to it.

            drigby Dave Rigby added a comment - Ok, so sounds like Magma's BloomFilter behaves like the ep-engine Bloom Filter in full-eviction's mode, assuming resident ratio is below bfilter_residency_threshold which defaults to 10%: see BloomFilterCallback::callback() : if (store.getItemEvictionPolicy() == EvictionPolicy::Value) { /** * VALUE-ONLY EVICTION POLICY * Consider deleted items only. */ if (isDeleted) { vb->addToTempFilter(key); } } else { /** * FULL EVICTION POLICY * If vbucket's resident ratio is found to be less than * the residency threshold, consider all items, otherwise * consider deleted and non-resident items only. */ bool residentRatioLessThanThreshold = vb->isResidentRatioUnderThreshold( store.getBfiltersResidencyThreshold()); if (residentRatioLessThanThreshold) { vb->addToTempFilter(key); } else { if (isDeleted || !store.isMetaDataResident(vb, key)) { vb->addToTempFilter(key); } } } As you point out, there isn't the opportunity with Magma to rebuild the ep-engine level Bloom filters during full compaction, so it's not feasible to re-enable the ep-engine value-eviction delete-only Bloom filter when Magma is being used. One slightly surprising thing is the slowdown seen when performing the Magma bgFetch for the deleted item - assuming the key being looked up truly doesn't exist (which I'm assuming based on the workload being called "initial load"), then I would expect that Magma should be able to return that information quickly, if it is just checking memtables and (pinned IIRC?) Bloom filter block reads for each SST file. Yes there is the overhead of queuing and managing the BGFetch task, but I wouldn't expect that should be too large. If you have the logs to hand, you could look at the bg_wait and bg_load histograms on the destination cluster - those are the times each BGFetcher task was waiting to run, and then how long it ran for. Assuming we won't be able to reduce the BGLoad time much with a new API (it'll be doing essentially the same work, just without the task management), we should see how BGLoad compares to it.
            sarath Sarath Lakshman added a comment - - edited

            Looking at the stats, get_meta is taking the most time compared to couchstore (avg 2us vs 45us) and contribution from bg_wait + bg_load (avg 15us + 13us). I agree, unless we are able to add a much lightweight KeyMayExist to magma which avoids some of the allocations and expensive work in magma::Get as well as magma-kvstore allocations, we won't be able to see much difference. We could potentially save the bg_wait time only.

             
            Couchstore:GET_META
            [  0.00 -   0.00]us (0.0000%)          38|
            [  0.00 -   1.00]us (10.0000%)   94874274| ############################################
            [  1.00 -   1.00]us (20.0000%)          0|
            [  1.00 -   1.00]us (30.0000%)          0|
            [  1.00 -   1.00]us (40.0000%)          0|
            [  1.00 -   2.00]us (50.0000%)   73288014| #################################
            [  2.00 -   2.00]us (55.0000%)          0|
            [  2.00 -   2.00]us (60.0000%)          0|
            [  2.00 -   2.00]us (65.0000%)          0|
            [  2.00 -   2.00]us (70.0000%)          0|
            [  2.00 -   2.00]us (75.0000%)          0|
            [  2.00 -   2.00]us (77.5000%)          0|
            [  2.00 -   2.00]us (80.0000%)          0|
            [  2.00 -   2.00]us (82.5000%)          0|
            [  2.00 -   3.00]us (85.0000%)   18531107| ########
            [  3.00 -   3.00]us (87.5000%)          0|
            [  3.00 -   3.00]us (88.7500%)          0|
            [  3.00 -   3.00]us (90.0000%)          0|
             
            Couchstore:SET_WITH_META
            [  0.00 -   7.00]us (0.0000%)        1804|
            [  7.00 -  14.00]us (10.0000%)   32829907| ############################################
            [ 14.00 -  15.00]us (20.0000%)   18193265| ########################
            [ 15.00 -  16.00]us (30.0000%)   12680040| ################
            [ 16.00 -  18.00]us (40.0000%)   22889054| ##############################
            [ 18.00 -  20.00]us (50.0000%)   19768797| ##########################
            [ 20.00 -  21.00]us (55.0000%)    8738984| ###########
            [ 21.00 -  22.00]us (60.0000%)    8702046| ###########
            [ 22.00 -  23.00]us (65.0000%)    8219478| ###########
            [ 23.00 -  25.00]us (70.0000%)   13427851| #################
            [ 25.00 -  27.00]us (75.0000%)    9817161| #############
            [ 27.00 -  28.00]us (77.5000%)    4026131| #####
            [ 28.00 -  29.00]us (80.0000%)    3600142| ####
            [ 29.00 -  31.00]us (82.5000%)    6114320| ########
            [ 31.00 -  33.00]us (85.0000%)    4856487| ######
            [ 33.00 -  35.00]us (87.5000%)    3776872| #####
            [ 35.00 -  37.00]us (88.7500%)    2916561| ###
            [ 37.00 -  39.00]us (90.0000%)    2327134| ###
             
            Magma:GET_META
            [  0.00 -  15.00]us (0.0000%)           1|
            [ 15.00 -  33.00]us (10.0000%)   24933018| ###################################
            [ 33.00 -  37.00]us (20.0000%)   27659672| #######################################
            [ 37.00 -  39.00]us (30.0000%)   14992550| #####################
            [ 39.00 -  41.00]us (40.0000%)   15372522| ######################
            [ 41.00 -  45.00]us (50.0000%)   30580630| ############################################
            [ 45.00 -  45.00]us (55.0000%)          0|
            [ 45.00 -  47.00]us (60.0000%)   13801443| ###################
            [ 47.00 -  49.00]us (65.0000%)   12080239| #################
            [ 49.00 -  51.00]us (70.0000%)   10275817| ##############
            [ 51.00 -  53.00]us (75.0000%)    8639969| ############
            [ 53.00 -  53.00]us (77.5000%)          0|
            [ 53.00 -  55.00]us (80.0000%)    7182416| ##########
            [ 55.00 -  55.00]us (82.5000%)          0|
            [ 55.00 -  57.00]us (85.0000%)    5895673| ########
            [ 57.00 -  59.00]us (87.5000%)    4790305| ######
            [ 59.00 -  61.00]us (88.7500%)    3884094| #####
            [ 61.00 -  63.00]us (90.0000%)    3155043| ####
             
            bg_load (200195320 total)
                   0us -    3us : (  0.0001%)      167
                   3us -    8us : ( 15.4150%) 30859959 #####
                   8us -    9us : ( 26.2742%) 21739563 ###
                   9us -   10us : ( 37.6359%) 22745641 ###
                  10us -   11us : ( 47.5433%) 19834146 ###
                  11us -   12us : ( 55.4577%) 15844259 ##
                  12us -   12us : ( 55.4577%)        0
                  12us -   13us : ( 61.9125%) 12922259 ##
                  13us -   14us : ( 67.6509%) 11487933 ##
                  14us -   15us : ( 73.0881%) 10885056 #
                  15us -   16us : ( 78.2455%) 10324826 #
                  16us -   16us : ( 78.2455%)        0
                  16us -   17us : ( 82.9068%)  9331629 #
                  17us -   17us : ( 82.9068%)        0
                  17us -   18us : ( 86.8507%)  7895522 #
                  18us -   19us : ( 89.9693%)  6243306 #
                  19us -   19us : ( 89.9693%)        0
                  19us -   20us : ( 92.3060%)  4678051
             
             bg_wait (200195320 total)
                   0us -    1us : (  0.0000%)        9
                   1us -    9us : ( 13.9306%) 27888376 ####
                   9us -   10us : ( 22.4575%) 17070440 ##
                  10us -   11us : ( 32.3701%) 19844622 ###
                  11us -   12us : ( 43.3733%) 22027818 ###
                  12us -   13us : ( 53.8355%) 20944885 ###
                  13us -   14us : ( 63.2961%) 18939702 ###
                  14us -   14us : ( 63.2961%)        0
                  14us -   15us : ( 71.1109%) 15644865 ##
                  15us -   15us : ( 71.1109%)        0
                  15us -   16us : ( 77.1788%) 12147619 ##
                  16us -   17us : ( 81.7857%)  9222777 #
                  17us -   17us : ( 81.7857%)        0
                  17us -   18us : ( 85.2620%)  6959391 #
                  18us -   18us : ( 85.2620%)        0
                  18us -   19us : ( 87.8841%)  5249372
                  19us -   20us : ( 89.8834%)  4002531
                  20us -   21us : ( 91.4266%)  3089343
            
            

            couchstore: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/70/console
            magma: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/82/console

            sarath Sarath Lakshman added a comment - - edited Looking at the stats, get_meta is taking the most time compared to couchstore (avg 2us vs 45us) and contribution from bg_wait + bg_load (avg 15us + 13us). I agree, unless we are able to add a much lightweight KeyMayExist to magma which avoids some of the allocations and expensive work in magma::Get as well as magma-kvstore allocations, we won't be able to see much difference. We could potentially save the bg_wait time only.   Couchstore:GET_META [ 0.00 - 0.00]us (0.0000%) 38| [ 0.00 - 1.00]us (10.0000%) 94874274| ############################################ [ 1.00 - 1.00]us (20.0000%) 0| [ 1.00 - 1.00]us (30.0000%) 0| [ 1.00 - 1.00]us (40.0000%) 0| [ 1.00 - 2.00]us (50.0000%) 73288014| ################################# [ 2.00 - 2.00]us (55.0000%) 0| [ 2.00 - 2.00]us (60.0000%) 0| [ 2.00 - 2.00]us (65.0000%) 0| [ 2.00 - 2.00]us (70.0000%) 0| [ 2.00 - 2.00]us (75.0000%) 0| [ 2.00 - 2.00]us (77.5000%) 0| [ 2.00 - 2.00]us (80.0000%) 0| [ 2.00 - 2.00]us (82.5000%) 0| [ 2.00 - 3.00]us (85.0000%) 18531107| ######## [ 3.00 - 3.00]us (87.5000%) 0| [ 3.00 - 3.00]us (88.7500%) 0| [ 3.00 - 3.00]us (90.0000%) 0|   Couchstore:SET_WITH_META [ 0.00 - 7.00]us (0.0000%) 1804| [ 7.00 - 14.00]us (10.0000%) 32829907| ############################################ [ 14.00 - 15.00]us (20.0000%) 18193265| ######################## [ 15.00 - 16.00]us (30.0000%) 12680040| ################ [ 16.00 - 18.00]us (40.0000%) 22889054| ############################## [ 18.00 - 20.00]us (50.0000%) 19768797| ########################## [ 20.00 - 21.00]us (55.0000%) 8738984| ########### [ 21.00 - 22.00]us (60.0000%) 8702046| ########### [ 22.00 - 23.00]us (65.0000%) 8219478| ########### [ 23.00 - 25.00]us (70.0000%) 13427851| ################# [ 25.00 - 27.00]us (75.0000%) 9817161| ############# [ 27.00 - 28.00]us (77.5000%) 4026131| ##### [ 28.00 - 29.00]us (80.0000%) 3600142| #### [ 29.00 - 31.00]us (82.5000%) 6114320| ######## [ 31.00 - 33.00]us (85.0000%) 4856487| ###### [ 33.00 - 35.00]us (87.5000%) 3776872| ##### [ 35.00 - 37.00]us (88.7500%) 2916561| ### [ 37.00 - 39.00]us (90.0000%) 2327134| ###   Magma:GET_META [ 0.00 - 15.00]us (0.0000%) 1| [ 15.00 - 33.00]us (10.0000%) 24933018| ################################### [ 33.00 - 37.00]us (20.0000%) 27659672| ####################################### [ 37.00 - 39.00]us (30.0000%) 14992550| ##################### [ 39.00 - 41.00]us (40.0000%) 15372522| ###################### [ 41.00 - 45.00]us (50.0000%) 30580630| ############################################ [ 45.00 - 45.00]us (55.0000%) 0| [ 45.00 - 47.00]us (60.0000%) 13801443| ################### [ 47.00 - 49.00]us (65.0000%) 12080239| ################# [ 49.00 - 51.00]us (70.0000%) 10275817| ############## [ 51.00 - 53.00]us (75.0000%) 8639969| ############ [ 53.00 - 53.00]us (77.5000%) 0| [ 53.00 - 55.00]us (80.0000%) 7182416| ########## [ 55.00 - 55.00]us (82.5000%) 0| [ 55.00 - 57.00]us (85.0000%) 5895673| ######## [ 57.00 - 59.00]us (87.5000%) 4790305| ###### [ 59.00 - 61.00]us (88.7500%) 3884094| ##### [ 61.00 - 63.00]us (90.0000%) 3155043| ####   bg_load (200195320 total) 0us - 3us : ( 0.0001%) 167 3us - 8us : ( 15.4150%) 30859959 ##### 8us - 9us : ( 26.2742%) 21739563 ### 9us - 10us : ( 37.6359%) 22745641 ### 10us - 11us : ( 47.5433%) 19834146 ### 11us - 12us : ( 55.4577%) 15844259 ## 12us - 12us : ( 55.4577%) 0 12us - 13us : ( 61.9125%) 12922259 ## 13us - 14us : ( 67.6509%) 11487933 ## 14us - 15us : ( 73.0881%) 10885056 # 15us - 16us : ( 78.2455%) 10324826 # 16us - 16us : ( 78.2455%) 0 16us - 17us : ( 82.9068%) 9331629 # 17us - 17us : ( 82.9068%) 0 17us - 18us : ( 86.8507%) 7895522 # 18us - 19us : ( 89.9693%) 6243306 # 19us - 19us : ( 89.9693%) 0 19us - 20us : ( 92.3060%) 4678051   bg_wait (200195320 total) 0us - 1us : ( 0.0000%) 9 1us - 9us : ( 13.9306%) 27888376 #### 9us - 10us : ( 22.4575%) 17070440 ## 10us - 11us : ( 32.3701%) 19844622 ### 11us - 12us : ( 43.3733%) 22027818 ### 12us - 13us : ( 53.8355%) 20944885 ### 13us - 14us : ( 63.2961%) 18939702 ### 14us - 14us : ( 63.2961%) 0 14us - 15us : ( 71.1109%) 15644865 ## 15us - 15us : ( 71.1109%) 0 15us - 16us : ( 77.1788%) 12147619 ## 16us - 17us : ( 81.7857%) 9222777 # 17us - 17us : ( 81.7857%) 0 17us - 18us : ( 85.2620%) 6959391 # 18us - 18us : ( 85.2620%) 0 18us - 19us : ( 87.8841%) 5249372 19us - 20us : ( 89.8834%) 4002531 20us - 21us : ( 91.4266%) 3089343 couchstore: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/70/console magma: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/82/console
            sarath Sarath Lakshman made changes -
            Component/s storage-engine [ 10175 ]
            bo-chun.wang Bo-Chun Wang made changes -
            Issue Type Task [ 3 ] Improvement [ 4 ]
            srinath.duvuru Srinath Duvuru made changes -
            Fix Version/s Neo [ 17615 ]
            Fix Version/s Morpheus [ 17651 ]
            sarath Sarath Lakshman added a comment - - edited

            By increasing the target nozzles to 8, we are able to generate 404758/sec XDCR throughput with magma. Note that the baseline for magma with 4 threads is 278766/sec.
            http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/95/parameters/

            Couchstore cluster is able to generate 469036/sec with target nozzles=4. With 8 nozzles Couchstore throughput should be significantly higher than magma.

            Shivani Gupta We can use increasing the nozzles (with CPU and other additional XDCR resources) as a workaround for now.

            sarath Sarath Lakshman added a comment - - edited By increasing the target nozzles to 8, we are able to generate 404758/sec XDCR throughput with magma. Note that the baseline for magma with 4 threads is 278766/sec. http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/95/parameters/ Couchstore cluster is able to generate 469036/sec with target nozzles=4. With 8 nozzles Couchstore throughput should be significantly higher than magma. Shivani Gupta We can use increasing the nozzles (with CPU and other additional XDCR resources) as a workaround for now.
            drigby Dave Rigby added a comment -

            What performance do we see with couchstore and 8 nozzles?

            drigby Dave Rigby added a comment - What performance do we see with couchstore and 8 nozzles?

            Dave Rigby I didn't try a couchstore run with 8. But with target nozzles=16, Couchstore is at 708045/sec and magma at 544253/sec

            sarath Sarath Lakshman added a comment - Dave Rigby I didn't try a couchstore run with 8. But with target nozzles=16, Couchstore is at 708045/sec and magma at 544253/sec
            drigby Dave Rigby added a comment -

            To expand on the above - if we are trying to measure the maximum throughput which can be achieved via XDCR with different storage backends, then the test should probably be calibrated to ensure that the bottleneck is KV-engine.

            As such, if 4 nozzles is insufficient to max out kv-engine (for magma), then it should be set at a higher value than 4 in general (for any storage backend) - possibly even higher until we see no further throughput improvements.

            drigby Dave Rigby added a comment - To expand on the above - if we are trying to measure the maximum throughput which can be achieved via XDCR with different storage backends, then the test should probably be calibrated to ensure that the bottleneck is KV-engine. As such, if 4 nozzles is insufficient to max out kv-engine (for magma), then it should be set at a higher value than 4 in general (for any storage backend) - possibly even higher until we see no further throughput improvements.
            drigby Dave Rigby added a comment -

            Sarath Lakshman thanks (I think I our updates crossed).

            Based on those numbers for 16 nozzles, I think we should change the test to at least 16 nozzles so we are pushing kv-engine harder. Possibly we still want a 4 nozzle test of thats useful for XDCR, but it seems like we won’t as easily detect regressions at the current nozzles=4.

            As an aside, is nozzles=4 still a good default? Is that something we already tell customers to increase?

            drigby Dave Rigby added a comment - Sarath Lakshman thanks (I think I our updates crossed). Based on those numbers for 16 nozzles, I think we should change the test to at least 16 nozzles so we are pushing kv-engine harder. Possibly we still want a 4 nozzle test of thats useful for XDCR, but it seems like we won’t as easily detect regressions at the current nozzles=4. As an aside, is nozzles=4 still a good default? Is that something we already tell customers to increase?
            sarath Sarath Lakshman added a comment - - edited

            I have a prototype exposing magma bloom filter through Magma::KeyMayExist API and use that API to enable KeyMayExist check for magma in the frontend threads. Throughtput improved from 278766/s to 424707/s.

            We see p50 GET_META latency drop from 45us to 10us with this change which skips bgFetch code path. This is roughly equivalent to bg_load latency (12us) we observed in the prior run (saving bgFetch cost). We may want to investigate if the 35us for the bg fetch path is reasonable. I don't know if it is related to more frontend threads vs lesser bg fetch threads and hence frontend waiting for bg fetch threads. This needs to be validated.

            The following data is collected for "GET_META"¬
            [  0.00 -   3.00]us (0.0000%)▸       4532| ¬
            [  3.00 -   6.00]us (10.0000%)▸  33703673| ############################################¬
            [  6.00 -   7.00]us (20.0000%)▸  22812220| #############################¬
            [  7.00 -   8.00]us (30.0000%)▸  18972811| ########################¬
            [  8.00 -   9.00]us (40.0000%)▸  15037567| ###################¬
            [  9.00 -  10.00]us (50.0000%)▸  12711962| ################¬
            [ 10.00 -  11.00]us (55.0000%)▸  11717598| ###############¬
            [ 11.00 -  12.00]us (60.0000%)▸  11238608| ##############¬
            [ 12.00 -  13.00]us (65.0000%)▸  10593831| #############¬
            [ 13.00 -  14.00]us (70.0000%)▸   9490448| ############¬
            [ 14.00 -  15.00]us (75.0000%)▸   8026767| ##########¬
            [ 15.00 -  16.00]us (77.5000%)▸   6497436| ########¬
            [ 16.00 -  16.00]us (80.0000%)▸         0| ¬
            [ 16.00 -  17.00]us (82.5000%)▸   5172050| ######¬
            [ 17.00 -  19.00]us (85.0000%)▸   7659343| #########¬
            [ 19.00 -  20.00]us (87.5000%)▸   3021360| ###¬
            [ 20.00 -  21.00]us (88.7500%)▸   2690605| ###¬
            [ 21.00 -  22.00]us (90.0000%)▸   2422527| ###¬
            [ 22.00 -  23.00]us (91.2500%)▸   2178659| ##¬
            [ 23.00 -  24.00]us (92.5000%)▸   1916079| ##¬
            [ 24.00 -  26.00]us (93.7500%)▸   2995509| ###¬
            [ 26.00 -  27.00]us (94.3750%)▸   1136144| #¬
            [ 27.00 -  28.00]us (95.0000%)▸    936203| #¬
            [ 28.00 -  29.00]us (95.6250%)▸    772154| #¬
            [ 29.00 -  31.00]us (96.2500%)▸   1178621| #¬
            [ 31.00 -  35.00]us (96.8750%)▸   1460445| #¬
            [ 35.00 -  37.00]us (97.1875%)▸    457187| ¬
            [ 37.00 -  41.00]us (97.5000%)▸    573703| ¬
            [ 41.00 -  49.00]us (97.8125%)▸    475481| ¬
            [ 49.00 -  67.00]us (98.1250%)▸    604837| ¬
            [ 67.00 -  79.00]us (98.4375%)▸    669727| ¬
            [ 79.00 -  87.00]us (98.5938%)▸    495717| ¬
            [ 87.00 -  91.00]us (98.7500%)▸    232754| ¬
            

            http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/92/console

            sarath Sarath Lakshman added a comment - - edited I have a prototype exposing magma bloom filter through Magma::KeyMayExist API and use that API to enable KeyMayExist check for magma in the frontend threads. Throughtput improved from 278766/s to 424707/s. We see p50 GET_META latency drop from 45us to 10us with this change which skips bgFetch code path. This is roughly equivalent to bg_load latency (12us) we observed in the prior run (saving bgFetch cost). We may want to investigate if the 35us for the bg fetch path is reasonable. I don't know if it is related to more frontend threads vs lesser bg fetch threads and hence frontend waiting for bg fetch threads. This needs to be validated. The following data is collected for "GET_META"¬ [ 0.00 - 3.00]us (0.0000%)▸ 4532| ¬ [ 3.00 - 6.00]us (10.0000%)▸ 33703673| ############################################¬ [ 6.00 - 7.00]us (20.0000%)▸ 22812220| #############################¬ [ 7.00 - 8.00]us (30.0000%)▸ 18972811| ########################¬ [ 8.00 - 9.00]us (40.0000%)▸ 15037567| ###################¬ [ 9.00 - 10.00]us (50.0000%)▸ 12711962| ################¬ [ 10.00 - 11.00]us (55.0000%)▸ 11717598| ###############¬ [ 11.00 - 12.00]us (60.0000%)▸ 11238608| ##############¬ [ 12.00 - 13.00]us (65.0000%)▸ 10593831| #############¬ [ 13.00 - 14.00]us (70.0000%)▸ 9490448| ############¬ [ 14.00 - 15.00]us (75.0000%)▸ 8026767| ##########¬ [ 15.00 - 16.00]us (77.5000%)▸ 6497436| ########¬ [ 16.00 - 16.00]us (80.0000%)▸ 0| ¬ [ 16.00 - 17.00]us (82.5000%)▸ 5172050| ######¬ [ 17.00 - 19.00]us (85.0000%)▸ 7659343| #########¬ [ 19.00 - 20.00]us (87.5000%)▸ 3021360| ###¬ [ 20.00 - 21.00]us (88.7500%)▸ 2690605| ###¬ [ 21.00 - 22.00]us (90.0000%)▸ 2422527| ###¬ [ 22.00 - 23.00]us (91.2500%)▸ 2178659| ##¬ [ 23.00 - 24.00]us (92.5000%)▸ 1916079| ##¬ [ 24.00 - 26.00]us (93.7500%)▸ 2995509| ###¬ [ 26.00 - 27.00]us (94.3750%)▸ 1136144| #¬ [ 27.00 - 28.00]us (95.0000%)▸ 936203| #¬ [ 28.00 - 29.00]us (95.6250%)▸ 772154| #¬ [ 29.00 - 31.00]us (96.2500%)▸ 1178621| #¬ [ 31.00 - 35.00]us (96.8750%)▸ 1460445| #¬ [ 35.00 - 37.00]us (97.1875%)▸ 457187| ¬ [ 37.00 - 41.00]us (97.5000%)▸ 573703| ¬ [ 41.00 - 49.00]us (97.8125%)▸ 475481| ¬ [ 49.00 - 67.00]us (98.1250%)▸ 604837| ¬ [ 67.00 - 79.00]us (98.4375%)▸ 669727| ¬ [ 79.00 - 87.00]us (98.5938%)▸ 495717| ¬ [ 87.00 - 91.00]us (98.7500%)▸ 232754| ¬ http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/92/console

            Dave Rigby The goal here is not to measure the highest throughput for XDCR with magma/couchstore. Compared to the current standard XDCR test (target nozzle=4), magma XDCR throughput is lower due to higher GET_META latency. We are investigating whether we can achieve a similar number as couchstore.

            sarath Sarath Lakshman added a comment - Dave Rigby The goal here is not to measure the highest throughput for XDCR with magma/couchstore. Compared to the current standard XDCR test (target nozzle=4), magma XDCR throughput is lower due to higher GET_META latency. We are investigating whether we can achieve a similar number as couchstore.
            drigby Dave Rigby added a comment - - edited

            If we have a throughput test which kv-engine is being measured I would assert that it should be calibrated so kv-engine is essentially pegged one way or another. Otherwise, if say kv-engine cost for SetWithMeta went up by 10% (we suddenly did something which took 10% more CPU) then one would not necessarily notice if CPU went up by 10% (but throughput was sustained).

            If the test is instead measuring the throughput of XDCR itself, then it should be the bottleneck - which given the speedup we get increasing nozzles, that doesn’t seem to be the case either.

            Essentially a max throughput test should be pegging at least one of the components under test; and that doesn’t appear to be the case here.

            drigby Dave Rigby added a comment - - edited If we have a throughput test which kv-engine is being measured I would assert that it should be calibrated so kv-engine is essentially pegged one way or another. Otherwise, if say kv-engine cost for SetWithMeta went up by 10% (we suddenly did something which took 10% more CPU) then one would not necessarily notice if CPU went up by 10% (but throughput was sustained). If the test is instead measuring the throughput of XDCR itself, then it should be the bottleneck - which given the speedup we get increasing nozzles, that doesn’t seem to be the case either. Essentially a max throughput test should be pegging at least one of the components under test; and that doesn’t appear to be the case here.
            jliang John Liang added a comment -

            > is nozzles=4 still a good default? Is that something we already tell customers to increase?

            We don't have a sizing formula for nozzle. For example, the number of nozzle can depend on the number of CPU at target clusters. If default is too high, it could get to CPU saturation or temp failure. So I would rather keep it as it is. If source cluster has mutation rate higher than 80K per node, then customer can increase the number of nozzle, one at a time until it reaches the desired throughput.

            jliang John Liang added a comment - > is nozzles=4 still a good default? Is that something we already tell customers to increase? We don't have a sizing formula for nozzle. For example, the number of nozzle can depend on the number of CPU at target clusters. If default is too high, it could get to CPU saturation or temp failure. So I would rather keep it as it is. If source cluster has mutation rate higher than 80K per node, then customer can increase the number of nozzle, one at a time until it reaches the desired throughput.

            Build couchbase-server-7.1.0-2168 contains magma commit 8680726 with commit message:
            MB-48834 magma: Introduce KeyMayExist API

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2168 contains magma commit 8680726 with commit message: MB-48834 magma: Introduce KeyMayExist API
            drigby Dave Rigby made changes -
            Epic Link MB-30659 [ 88207 ] MB-51282 [ 184825 ]

            People

              sarath Sarath Lakshman
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There is 1 open Gerrit change

                  PagerDuty