Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
7.1.0
Description
I re-run two existing XDCR tests with Magma. Compared to Couchstore, Magma performance is about 50% lower. I open this ticket to track XDCR+Magma performance improvement. All runs were running on build 7.1.0-1401.
Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB
Storage | XDCR rate | Job |
---|---|---|
Couchstore | 141381 | http://perf.jenkins.couchbase.com/job/titan/12218/ |
Magma | 79794 | http://perf.jenkins.couchbase.com/job/titan/12214/ |
Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 250M x 1KB
Storage | Heading 2 | Job |
---|---|---|
Couchstore | 619398 | http://perf.jenkins.couchbase.com/job/titan/12217/ |
Magma | 343931 | http://perf.jenkins.couchbase.com/job/titan/12215/ |
Attachments
- magma-105.png
- 43 kB
Issue Links
- is duplicated by
-
MB-48569 [Magma, 30TB, 1% DGM]: Indexer drain rate is extremely slow
-
- Closed
-
Gerrit Reviews
For Gerrit Dashboard: MB-48834 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
168943,2 | MB-48834 util/file: Add support for sync_file_range | master | magma | Status: NEW | 0 | +1 |
168903,9 | MB-48834 magma: Introduce KeyMayExist API | master | magma | Status: MERGED | +2 | +1 |
Activity
I take a look at the both tests, and I see similar behavior.
Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB
Storage | XDCR rate | Job |
---|---|---|
Couchstore | 141381 | http://perf.jenkins.couchbase.com/job/titan/12218/ |
Magma | 79794 | http://perf.jenkins.couchbase.com/job/titan/12214/ |
I see Couchbase has higher DCP drain rate at the source (c1).
Both runs have 100% resident ratio. However, I see the Magma run is reading data from disk and the Couchstore run doesn't do it. It results in higher disk utilization in the Magma run.
At the destination (c2), I see the Magma run has bg wait time and the Couchstore doesn't have it.
I assign the ticket to the KV team so they can take a look at it.
Focusing on Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB test, which is single node
Source (c1) - .105
Destination (c2) - .100
On the Destination only WANRINGs wee see is 25 Slow messages for Destroying closed unreferenced checkpoints. All but 3 are < 1 second. The exceptions are:
2021-10-07T19:00:02.419289-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 1079 ms
|
2021-10-07T19:00:07.453254-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 5034 ms
|
2021-10-07T19:00:35.328182-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 5100 ms
|
Therefore the focus is on the source side.
Couchstore (node .105)
1024 backfills scheduled at T00:47:49
backfill complete - T00:57:23 to T00:59:24
Couchstore backfills take between 10 and 12 minutes
Magma (node .105)
1024 backfill scheduled at T18:44:24
backfill complete - T19:01:25 to T19:05:03
Magma backfills take between 17 and 21 minutes
Focusing on Magma run.
Memory from KV perspective for source node is
However in the memcached.log we repeatedly see the following message a total of 30K times:
2021-10-07T18:44:25.932084-07:00 WARNING (bucket-1) MagmaKVStore::scan lookup->callback vb:144 key:<ud>cid:0x0:9e6655-000004112907</ud> returned cb::engine_errc::no_memory
|
...
|
2021-10-07T19:05:03.447289-07:00 WARNING (bucket-1) MagmaKVStore::scan lookup->callback vb:417 key:<ud>cid:0x0:9e6655-000096367441</ud> returned cb::engine_errc::no_memory
|
Update: After speaking to Ben Huddleston - these messages can be ignored, just means DCP buffer is full
The logging has been addressed in https://review.couchbase.org/c/kv_engine/+/166762 - Thanks Ben Huddleston
Final observation is that we see on magma .105 - we see a couple of very slow runtimes.
2021-10-07T18:41:34.374270-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool4: 88 s
|
2021-10-07T18:43:50.265101-07:00 WARNING (No Engine) Slow runtime for 'Destroying closed unreferenced checkpoints' on thread NonIoPool1: 136 s
|
On discussing with James Harrison we looked as the Task runtimes and see the slowest CheckpointDestroyerTask[NonIO] is
327ms - 5505ms : (100.0000%) 1
|
So it may be an issue with the reporting of Slow runtimes - however warrants further investigation.
But in summary from the investigation so far - it is reasonable to conclude that the slowdown is due to backfills taking nearly 2x longer with Magma.
Bo-Chun Wang Can we rerun the test with all graphs enabled (all the ones we generally run for magma perf tests)?
Do we have an XDCR magma test with a lower residence ratio and larger data density?
Are we seeing similar degradation in those tests?
For 100% in-memory tests, given the current magma design, degradation may be expected.
For magma we store key and value together in seqIndex. Even if we can fetch the value from kv-engine inmemory cache, the value read will happen from disk. So, for couchstore it incurs only the cost of reading keys while magma has to read both key and value. For lower resident buckets, magma's IO cost for fetching values should be lower than couchstore.
Sarath Lakshman The degradation is 50%. If it is 100% in-memory, why it needs to fetch key and value from disk?
Even though all key-values are available in kv-engine cache, the seqIndex has to be used to read in bySeqno order.
The degradation is a problem we need to think through more. The in-memory lookup/skip value read is not something we thought through in the design. The degradation could happen due to the layout of kv-pairs and index stored on disk.
For magma seqIndex, we pack 4KB worth of kv pairs in an sstable data block. Then, index blocks point to the data block. When we do a bySeqno iteration, we have to read all the data blocks even if we have do not use-values. The seqindex iterator has to return key, metadata. Since we store key,meta,value contiguously together, I/O read is unavoidable. For couchstore, values are stored separately, hence it can skip extra read I/O optionally if we do not want to read the value. We may have to think about some index design changes to overcome this problem. But, it is likely a difficult problem to solve.
I have re-run the tests and collected kvstats. All runs are running with build 7.1.0-1885. Couchstore has better performance in both tests. Note that, there is a regression in XDCR tests (MB-50016) so the numbers are lower than previous ones.
Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 1G x 1KB, DGM
Storage | XDCR rate | Job |
---|---|---|
Magma | 236860 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/67/ |
Couchstore | 469036 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/70/ |
Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 250M x 1KB
Storage | XDCR rate | Job |
---|---|---|
Magma | 227256 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/69/ |
Couchstore | 358389 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/68/ |
Bo-Chun Wang Can we run a variant of the experiment, 1 bucket x 1G x 1KB, DGM with source bucket as couchstore and destination bucket as magma ?
I finished a run. The source bucket is using couchstore, and the destination bucket is using magma. The result is similar to the run using magma for both buckets.
Avg. initial XDCR rate (items/sec), 5 -> 5 (2 source nozzles, 4 target nozzles), 1 bucket x 1G x 1KB, DGM
Storage | XDCR rate | Job |
---|---|---|
Magma -> Magma | 236860 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/67/ |
Couchstore -> Couchstore | 469036 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/70/ |
Couchstore -> Magma | 269379 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/71/ |
Magma -> Couchstore | 494571 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/72/ |
Thanks Bo-Chun Wang. Can we do a run with source=magma and dest=couchstore as well?
Bo-Chun WangIs there a similar test for optimistic replication? If so, can we also do a run with couchstore-magma on optimistic replication? Thanks.
We don't have DGM tests for optimistic replication. I will re-run this non-DGM test with couchbastore-magma.
For normal replication, it will perform a read (before write) on every mutation. For optimistic replication, it will only perform write. So just to see if there is any difference.
Comparison between magma and couchstore as destination
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_c2_710-1885_init_xdcr_2773&label=magma_dst&snapshot=titan_c2_710-1885_init_xdcr_389b&label=couch_dst
The following plot gives a good explanation why couchstore destination is fast
There is plenty of free memory. Couchstore data files are 100% cached. Hence, reads performed during btree writes do not incur any i/o.
Magma uses direct i/o during writes and hence, it requires an I/O the first time a block is read. But I suspect some ineffectiveness for magma to aggressively take advantage of the page cache. Even after it runs for the entire duration, the amount of data cached is very low. I will investigate this further.
Avg. initial XDCR rate (items/sec), 1 -> 1 (2 source nozzles, 4 target nozzles), 1 bucket x 100M x 1KB, Optimistic
Storage | XDCR rate | Job |
---|---|---|
Couchstore | 149460 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/73/ |
Magma | 46902 | http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/74/ |
Sarath LakshmanIf the traffic is sequential in order (for both seq and doc key), it won't need a lot of requiring a lot of page caching, right? Also, note that it is backfill, so there is no data at the target cluster to begin with.
The source cluster (dcp/backfill) is doing well as it is sequential read. For magma, the problem is at the destination. Every document write operation requires a disk lookup to maintain the count. In the case of couchstore, 100% of the disk blocks are cached in page cache as there is plenty of memory. For magma, the read IOs are slowing down the writes.
In this case since all operations are inserts, we may not be doing disk lookups as the bloom filter helps there. But, the compactons are incurring read I/Os.
For magma destination, the write queue is not building up. That indicates there aren't enough mutations coming to the storage engine at higher rate.
For magma, I noticed bg fetches happening on the destination cluster. But for couchstore, there are no bg fetches happening.
This appears to be related to the bloom filter available in kv-engine for couchstore. For set_with_meta / get_meta operation, couchstore returns not-found immediately by checking the bloom filter. In the case of magma, it queues a bg fetch to find out an item does not exist (Internally bg fetch results in checking magma bloomfilter). The extra bg fetches are resulting in lower XDCR throughput on the destination cluster. This cluster is running value-only eviction.
Daniel Owen For value-only eviction, we can avoid any bg fetch for reading doc metadata right?
Looking at the code, we do a value eviction check for get API, but not in all cases for setWithMeta and getMeta API.
https://github.com/couchbase/kv_engine/blob/master/engines/ep/src/vbucket.cc#L2926
https://github.com/couchbase/kv_engine/blob/master/engines/ep/src/vbucket.cc#L2008
Hi Sarath Lakshman,
Many thanks for your analysis
I agree for value-only eviction we should not require a bg fetch as we should just be able to examine the hash table.
Feel free to assign back to me.
Changing the component from storage_engine to couchbase_bucket.
I synced-up with Dave Rigby
We only keep alive items in hashtable in general (they can be present temporarily if someone requests a deleted doc metadata)
So in the getMeta case Sarath mentions, even with value eviction if the item isn't resident in HT we must go to disk (which can potentially be skipped if the bloom filter tells us there's no such tombstone for that key). See for example https://github.com/couchbase/kv_engine/blob/f9016f1b4acc2dfd1ef911e8a7424fefd95fd0f1/engines/ep/src/vbucket.cc#L2911
Where we return if deleted or not (potentially after a bgfetch when we call getMeta a second time)
Sarath Lakshman do you agree that its worth having a delete-only bloom filter in ep-engine of value for magma value eviction?
thanks
Thanks Daniel Owen.
If I understand correctly, if we have to avoid bg fetch on non-exist keys for value-only eviction, we need to keep a bloom filter to address the special case of deleted docs. Does the deleted doc mean tombstone document?
In this specific XDCR test case, setWithMeta is the one triggering bgFetch.
For couchstore, we rebuild the bloom filter every time full compaction happens. For magma, when the logically deleted doc is removed, we have to remove it from the bloom filter as well. But, bloom filter does not support a remove operation. Since magma does not have periodic full compaction, we may not be able to rebuild the bloom filter in KV-Engine.
Magma internally maintains bloom filter per sstable for the key existence check, we could expose this bloom filter through a magma KeyMayExist API that only checks in-memory bloom filter without any I/O. Essentially when we queue a bgFetch, it checks against this bloom filter to respond not-found. I wonder if we directly expose this API and avoid bgFetch queueing code path, whether that would help improve the throughput.
Sarath Lakshman Does the Magma bloom filter track logically deleted (tombstones) keys, or just alive keys?
If it's the latter, then I don't think exposing an API to query it directly would necessarily make much difference - the issue with SetWithMeta (and GetMeta) is even if the item being compared has been deleted, we need to compare CAS (or revId) as the deleted item could still be the "winning" mutation. As such, it's not valid to check the bloom filter to see an an alive item exists, as we also need deleted ones.
Got it. So the logically deleted document is also a party in conflict resolution/CAS.
Magma bloom filter tracks the existence of a doc key item (it can be either tombstone or alive key) - not just the alive keys.
The KeyMayExist API will have to do a lookup in the memtable (we do not maintain a bloomfilter for memtable), then binary search in sstable lists in each level. This is certainly more time consuming than the kv-engine full key range bloom filter. These checks happen during magma::Get() during every bgFetch to identify the sstable to read. Adding this API will only help in skipping the bgFetch code path for the non-existent key check. We can give it a try.
Ok, so sounds like Magma's BloomFilter behaves like the ep-engine Bloom Filter in full-eviction's mode, assuming resident ratio is below bfilter_residency_threshold which defaults to 10%:
- see BloomFilterCallback::callback():
if (store.getItemEvictionPolicy() == EvictionPolicy::Value) {
/**
* VALUE-ONLY EVICTION POLICY
* Consider deleted items only.
*/
if (isDeleted) {
vb->addToTempFilter(key);
}
} else {
/**
* FULL EVICTION POLICY
* If vbucket's resident ratio is found to be less than
* the residency threshold, consider all items, otherwise
* consider deleted and non-resident items only.
*/
bool residentRatioLessThanThreshold =
vb->isResidentRatioUnderThreshold(
store.getBfiltersResidencyThreshold());
if (residentRatioLessThanThreshold) {
vb->addToTempFilter(key);
} else {
if (isDeleted || !store.isMetaDataResident(vb, key)) {
vb->addToTempFilter(key);
}
}
}
As you point out, there isn't the opportunity with Magma to rebuild the ep-engine level Bloom filters during full compaction, so it's not feasible to re-enable the ep-engine value-eviction delete-only Bloom filter when Magma is being used.
One slightly surprising thing is the slowdown seen when performing the Magma bgFetch for the deleted item - assuming the key being looked up truly doesn't exist (which I'm assuming based on the workload being called "initial load"), then I would expect that Magma should be able to return that information quickly, if it is just checking memtables and (pinned IIRC?) Bloom filter block reads for each SST file. Yes there is the overhead of queuing and managing the BGFetch task, but I wouldn't expect that should be too large.
If you have the logs to hand, you could look at the bg_wait and bg_load histograms on the destination cluster - those are the times each BGFetcher task was waiting to run, and then how long it ran for. Assuming we won't be able to reduce the BGLoad time much with a new API (it'll be doing essentially the same work, just without the task management), we should see how BGLoad compares to it.
Looking at the stats, get_meta is taking the most time compared to couchstore (avg 2us vs 45us) and contribution from bg_wait + bg_load (avg 15us + 13us). I agree, unless we are able to add a much lightweight KeyMayExist to magma which avoids some of the allocations and expensive work in magma::Get as well as magma-kvstore allocations, we won't be able to see much difference. We could potentially save the bg_wait time only.
|
Couchstore:GET_META
|
[ 0.00 - 0.00]us (0.0000%) 38|
|
[ 0.00 - 1.00]us (10.0000%) 94874274| ############################################
|
[ 1.00 - 1.00]us (20.0000%) 0|
|
[ 1.00 - 1.00]us (30.0000%) 0|
|
[ 1.00 - 1.00]us (40.0000%) 0|
|
[ 1.00 - 2.00]us (50.0000%) 73288014| #################################
|
[ 2.00 - 2.00]us (55.0000%) 0|
|
[ 2.00 - 2.00]us (60.0000%) 0|
|
[ 2.00 - 2.00]us (65.0000%) 0|
|
[ 2.00 - 2.00]us (70.0000%) 0|
|
[ 2.00 - 2.00]us (75.0000%) 0|
|
[ 2.00 - 2.00]us (77.5000%) 0|
|
[ 2.00 - 2.00]us (80.0000%) 0|
|
[ 2.00 - 2.00]us (82.5000%) 0|
|
[ 2.00 - 3.00]us (85.0000%) 18531107| ########
|
[ 3.00 - 3.00]us (87.5000%) 0|
|
[ 3.00 - 3.00]us (88.7500%) 0|
|
[ 3.00 - 3.00]us (90.0000%) 0|
|
|
Couchstore:SET_WITH_META
|
[ 0.00 - 7.00]us (0.0000%) 1804|
|
[ 7.00 - 14.00]us (10.0000%) 32829907| ############################################
|
[ 14.00 - 15.00]us (20.0000%) 18193265| ########################
|
[ 15.00 - 16.00]us (30.0000%) 12680040| ################
|
[ 16.00 - 18.00]us (40.0000%) 22889054| ##############################
|
[ 18.00 - 20.00]us (50.0000%) 19768797| ##########################
|
[ 20.00 - 21.00]us (55.0000%) 8738984| ###########
|
[ 21.00 - 22.00]us (60.0000%) 8702046| ###########
|
[ 22.00 - 23.00]us (65.0000%) 8219478| ###########
|
[ 23.00 - 25.00]us (70.0000%) 13427851| #################
|
[ 25.00 - 27.00]us (75.0000%) 9817161| #############
|
[ 27.00 - 28.00]us (77.5000%) 4026131| #####
|
[ 28.00 - 29.00]us (80.0000%) 3600142| ####
|
[ 29.00 - 31.00]us (82.5000%) 6114320| ########
|
[ 31.00 - 33.00]us (85.0000%) 4856487| ######
|
[ 33.00 - 35.00]us (87.5000%) 3776872| #####
|
[ 35.00 - 37.00]us (88.7500%) 2916561| ###
|
[ 37.00 - 39.00]us (90.0000%) 2327134| ###
|
|
Magma:GET_META
|
[ 0.00 - 15.00]us (0.0000%) 1|
|
[ 15.00 - 33.00]us (10.0000%) 24933018| ###################################
|
[ 33.00 - 37.00]us (20.0000%) 27659672| #######################################
|
[ 37.00 - 39.00]us (30.0000%) 14992550| #####################
|
[ 39.00 - 41.00]us (40.0000%) 15372522| ######################
|
[ 41.00 - 45.00]us (50.0000%) 30580630| ############################################
|
[ 45.00 - 45.00]us (55.0000%) 0|
|
[ 45.00 - 47.00]us (60.0000%) 13801443| ###################
|
[ 47.00 - 49.00]us (65.0000%) 12080239| #################
|
[ 49.00 - 51.00]us (70.0000%) 10275817| ##############
|
[ 51.00 - 53.00]us (75.0000%) 8639969| ############
|
[ 53.00 - 53.00]us (77.5000%) 0|
|
[ 53.00 - 55.00]us (80.0000%) 7182416| ##########
|
[ 55.00 - 55.00]us (82.5000%) 0|
|
[ 55.00 - 57.00]us (85.0000%) 5895673| ########
|
[ 57.00 - 59.00]us (87.5000%) 4790305| ######
|
[ 59.00 - 61.00]us (88.7500%) 3884094| #####
|
[ 61.00 - 63.00]us (90.0000%) 3155043| ####
|
|
bg_load (200195320 total)
|
0us - 3us : ( 0.0001%) 167
|
3us - 8us : ( 15.4150%) 30859959 #####
|
8us - 9us : ( 26.2742%) 21739563 ###
|
9us - 10us : ( 37.6359%) 22745641 ###
|
10us - 11us : ( 47.5433%) 19834146 ###
|
11us - 12us : ( 55.4577%) 15844259 ##
|
12us - 12us : ( 55.4577%) 0
|
12us - 13us : ( 61.9125%) 12922259 ##
|
13us - 14us : ( 67.6509%) 11487933 ##
|
14us - 15us : ( 73.0881%) 10885056 #
|
15us - 16us : ( 78.2455%) 10324826 #
|
16us - 16us : ( 78.2455%) 0
|
16us - 17us : ( 82.9068%) 9331629 #
|
17us - 17us : ( 82.9068%) 0
|
17us - 18us : ( 86.8507%) 7895522 #
|
18us - 19us : ( 89.9693%) 6243306 #
|
19us - 19us : ( 89.9693%) 0
|
19us - 20us : ( 92.3060%) 4678051
|
|
bg_wait (200195320 total)
|
0us - 1us : ( 0.0000%) 9
|
1us - 9us : ( 13.9306%) 27888376 ####
|
9us - 10us : ( 22.4575%) 17070440 ##
|
10us - 11us : ( 32.3701%) 19844622 ###
|
11us - 12us : ( 43.3733%) 22027818 ###
|
12us - 13us : ( 53.8355%) 20944885 ###
|
13us - 14us : ( 63.2961%) 18939702 ###
|
14us - 14us : ( 63.2961%) 0
|
14us - 15us : ( 71.1109%) 15644865 ##
|
15us - 15us : ( 71.1109%) 0
|
15us - 16us : ( 77.1788%) 12147619 ##
|
16us - 17us : ( 81.7857%) 9222777 #
|
17us - 17us : ( 81.7857%) 0
|
17us - 18us : ( 85.2620%) 6959391 #
|
18us - 18us : ( 85.2620%) 0
|
18us - 19us : ( 87.8841%) 5249372
|
19us - 20us : ( 89.8834%) 4002531
|
20us - 21us : ( 91.4266%) 3089343
|
|
couchstore: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/70/console
magma: http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/82/console
By increasing the target nozzles to 8, we are able to generate 404758/sec XDCR throughput with magma. Note that the baseline for magma with 4 threads is 278766/sec.
http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/95/parameters/
Couchstore cluster is able to generate 469036/sec with target nozzles=4. With 8 nozzles Couchstore throughput should be significantly higher than magma.
Shivani Gupta We can use increasing the nozzles (with CPU and other additional XDCR resources) as a workaround for now.
Dave Rigby I didn't try a couchstore run with 8. But with target nozzles=16, Couchstore is at 708045/sec and magma at 544253/sec
To expand on the above - if we are trying to measure the maximum throughput which can be achieved via XDCR with different storage backends, then the test should probably be calibrated to ensure that the bottleneck is KV-engine.
As such, if 4 nozzles is insufficient to max out kv-engine (for magma), then it should be set at a higher value than 4 in general (for any storage backend) - possibly even higher until we see no further throughput improvements.
Sarath Lakshman thanks (I think I our updates crossed).
Based on those numbers for 16 nozzles, I think we should change the test to at least 16 nozzles so we are pushing kv-engine harder. Possibly we still want a 4 nozzle test of thats useful for XDCR, but it seems like we won’t as easily detect regressions at the current nozzles=4.
As an aside, is nozzles=4 still a good default? Is that something we already tell customers to increase?
I have a prototype exposing magma bloom filter through Magma::KeyMayExist API and use that API to enable KeyMayExist check for magma in the frontend threads. Throughtput improved from 278766/s to 424707/s.
We see p50 GET_META latency drop from 45us to 10us with this change which skips bgFetch code path. This is roughly equivalent to bg_load latency (12us) we observed in the prior run (saving bgFetch cost). We may want to investigate if the 35us for the bg fetch path is reasonable. I don't know if it is related to more frontend threads vs lesser bg fetch threads and hence frontend waiting for bg fetch threads. This needs to be validated.
The following data is collected for "GET_META"¬
|
[ 0.00 - 3.00]us (0.0000%)▸ 4532| ¬
|
[ 3.00 - 6.00]us (10.0000%)▸ 33703673| ############################################¬
|
[ 6.00 - 7.00]us (20.0000%)▸ 22812220| #############################¬
|
[ 7.00 - 8.00]us (30.0000%)▸ 18972811| ########################¬
|
[ 8.00 - 9.00]us (40.0000%)▸ 15037567| ###################¬
|
[ 9.00 - 10.00]us (50.0000%)▸ 12711962| ################¬
|
[ 10.00 - 11.00]us (55.0000%)▸ 11717598| ###############¬
|
[ 11.00 - 12.00]us (60.0000%)▸ 11238608| ##############¬
|
[ 12.00 - 13.00]us (65.0000%)▸ 10593831| #############¬
|
[ 13.00 - 14.00]us (70.0000%)▸ 9490448| ############¬
|
[ 14.00 - 15.00]us (75.0000%)▸ 8026767| ##########¬
|
[ 15.00 - 16.00]us (77.5000%)▸ 6497436| ########¬
|
[ 16.00 - 16.00]us (80.0000%)▸ 0| ¬
|
[ 16.00 - 17.00]us (82.5000%)▸ 5172050| ######¬
|
[ 17.00 - 19.00]us (85.0000%)▸ 7659343| #########¬
|
[ 19.00 - 20.00]us (87.5000%)▸ 3021360| ###¬
|
[ 20.00 - 21.00]us (88.7500%)▸ 2690605| ###¬
|
[ 21.00 - 22.00]us (90.0000%)▸ 2422527| ###¬
|
[ 22.00 - 23.00]us (91.2500%)▸ 2178659| ##¬
|
[ 23.00 - 24.00]us (92.5000%)▸ 1916079| ##¬
|
[ 24.00 - 26.00]us (93.7500%)▸ 2995509| ###¬
|
[ 26.00 - 27.00]us (94.3750%)▸ 1136144| #¬
|
[ 27.00 - 28.00]us (95.0000%)▸ 936203| #¬
|
[ 28.00 - 29.00]us (95.6250%)▸ 772154| #¬
|
[ 29.00 - 31.00]us (96.2500%)▸ 1178621| #¬
|
[ 31.00 - 35.00]us (96.8750%)▸ 1460445| #¬
|
[ 35.00 - 37.00]us (97.1875%)▸ 457187| ¬
|
[ 37.00 - 41.00]us (97.5000%)▸ 573703| ¬
|
[ 41.00 - 49.00]us (97.8125%)▸ 475481| ¬
|
[ 49.00 - 67.00]us (98.1250%)▸ 604837| ¬
|
[ 67.00 - 79.00]us (98.4375%)▸ 669727| ¬
|
[ 79.00 - 87.00]us (98.5938%)▸ 495717| ¬
|
[ 87.00 - 91.00]us (98.7500%)▸ 232754| ¬
|
http://perf.jenkins.couchbase.com/job/titan-xdcr-dev/92/console
Dave Rigby The goal here is not to measure the highest throughput for XDCR with magma/couchstore. Compared to the current standard XDCR test (target nozzle=4), magma XDCR throughput is lower due to higher GET_META latency. We are investigating whether we can achieve a similar number as couchstore.
If we have a throughput test which kv-engine is being measured I would assert that it should be calibrated so kv-engine is essentially pegged one way or another. Otherwise, if say kv-engine cost for SetWithMeta went up by 10% (we suddenly did something which took 10% more CPU) then one would not necessarily notice if CPU went up by 10% (but throughput was sustained).
If the test is instead measuring the throughput of XDCR itself, then it should be the bottleneck - which given the speedup we get increasing nozzles, that doesn’t seem to be the case either.
Essentially a max throughput test should be pegging at least one of the components under test; and that doesn’t appear to be the case here.
> is nozzles=4 still a good default? Is that something we already tell customers to increase?
We don't have a sizing formula for nozzle. For example, the number of nozzle can depend on the number of CPU at target clusters. If default is too high, it could get to CPU saturation or temp failure. So I would rather keep it as it is. If source cluster has mutation rate higher than 80K per node, then customer can increase the number of nozzle, one at a time until it reaches the desired throughput.
Build couchbase-server-7.1.0-2168 contains magma commit 8680726 with commit message:
MB-48834 magma: Introduce KeyMayExist API
It is 100% resdient ratio. XDCR works off DCP and memcached API. So it is unlikely a XDCR issue though.