Test the performance of cbdatarecovery on magma buckets

Description

The recovery of magma buckets has been enabled by this change: https://couchbasecloud.atlassian.net/browse/MB-49475.

We should test the memory and CPU usage when recovering a magma bucket.

I have already tested this on my local machine with with a magma bucket that contains 1 shard and another one that contains 8 shards. The tests results are attached to this ticket with the script used to gather that data. NOTE: the script `mem_count_local_mac.bash` has been edited to work specifically on MacOS if on linux use `mem_count_linux.bash`

Further testing needs to be done with clusters that have around 100GB of data on them. The testing steps should be as follows:

  1. Spin up a couchbase server on an AWS instance.

  2. Create a magma bucket and using `cbpillowfight` generate 100GB of data on that bucket.

  3. Run the script for collecting memory and CPU data (attached) and then run `cbdatarecovery`.

  4. Collect the data.

We should do this for a magma bucket with 1 shard and 8 shards. We should then compare the results with the local run. We should determine whether the amount of data on magma affects the performance of `cbdatarecovery` and if it does look for ways to avoid/improve this.

Components

Affects versions

Fix versions

Labels

Environment

None

Release Notes Description

None

Attachments

12

Activity

Show:

Safian Ali January 9, 2024 at 12:08 PM

Created to handle this once is fixed

Apaar Gupta January 8, 2024 at 5:20 PM
Edited

This issue is a known issue with SetWithMeta in Magma first seen with XDCR in MB-48834. The issue is caused by kv_engine not querying Magma's bloomfilters during the get performed to retrieve meta during SetWithMeta resulting in kv_engine performing bg_fetches. Couchstore does not have this issue since the bloomfilters are maintained by kv_engine which avoid IO if the document does not exist.

implemented an API for kv_engine to query Magma's bloomfilter which resulted in GET_META latency dropping from 45us to 10us. This API has to be used by kv_engine to avoid the costly fetch.

I am not sure on the status of the improvement, pinging

Safian Ali January 8, 2024 at 4:13 PM
Edited

Assigned this ticket to storage-engine as this appears to be a Magma issue.

 

I’m working on testing the performance of cbdatarecovery with Magma. I’ve found that the combination of using SetWithMeta with Magma leads to much slower performance than using either Set or Couchstore. See the graphs attached to this ticket, and also the table below as a summary. All the tests below were done with a 10GB random data set created using pillowfight (10M items of 1KB size). In all cases, an empty bucket was created to restore to (i.e. no conflict resolution).

 

Set method used

Storage engine on the cluster

Time taken

Logs

Start Timestamp

Set

Couchstore

5m 12s

2024-01-08T12:11:34+00:00

Set

Magma

5m 17s

2024-01-08T12:20:18+00:00

SetWithMeta

Couchstore

5m 21s

2024-01-02T12:13:44+00:00

SetWithMeta

Magma

23m 44s

2024-01-02T12:24:22+00:00

 

Is this a known issue? Thanks

Safian Ali January 8, 2024 at 1:32 PM

The difference in performance between the backup tools and pillowfight seem to be caused by backup using SetWithMeta instead of Set. When the backup code is modified to always use Set (here), Magma/Couchstore performance is the same. Further evidence of this can be seen in “set_with_meta_latency_couchstore_vs_magma.png”, which has the couchstore test/latency on the left, and magma on the right (yellow is the 50th percentile, green is the 90th percentile) when using SetWithMeta. Comparing with “set_latency_couchstore_vs_magma.png”, you can see that SetWithMeta latency is much higher than the latency with Set when using Magma.

Safian Ali January 3, 2024 at 6:38 PM

The following methods show a significant slowdown (~5x) when the storage engine is Magma instead of Couchstore:

  • cbbackupmgr restore

  • cbbackupmgr generate

  • cbdatarecovery

However, cbc-pillowfight shows the same level of performance regardless of the storage engine used. Either something is wrong with the backup code, or there is something about gocbcore which makes it slower with Magma (pillowfight uses the C SDK). I'm skeptical of the former as AFAICT the backup code works the same way when sending docs to the cluster regardless of what storage engine is used. 

More ideas:

  • Make a basic perf testing program akin to pillowfight that can be used to stress test a cluster. I already tried this with gocb, but the ops/sec was too slow to see any difference. Might have to try with gocbcore, like is done in backup.

  • Compare memory profiles with different storage engines - where is the slowdown happening?

Done
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created October 12, 2023 at 3:04 PM
Updated September 10, 2024 at 11:24 PM
Resolved January 9, 2024 at 12:08 PM
Instabug