Avg. ingestion rate of 999 collections drops from 195K items/s to 57K items/s on build 7.6.0-1516

Description

Avg. ingestion rate (items/sec), 4 nodes, BigFUN 20M users (320M docs), 999 indexes, SSD, s=1 c=999

Affects versions

Fix versions

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Activity

Show:

Mohammad Zaeem October 31, 2023 at 2:20 PM

Resolving as reverting dcp_backfill_byte_limt back to 20MB such that the regression is no longer visible. Closing this ticket and creating a new task to investigate why changing dcp_backfill_byte_limt to 2MB causes this regression to occur.

Dave Rigby October 25, 2023 at 2:39 PM

As per 's / 's analysis above, CBAS appears to be making one StreamRequest per vbucket per collection, for a total of 999 * 1024 = 1,022,976 ActiveStreams across the 3 KV nodes.

While not necessarily directly related to the regression here, that is an extremely large number of ActiveStream objects for KV to manage. For a start each one consumes ~1KB of heap; so that's over 1GB of Bucket quota being consumed just to manage all those connections; not to mention the work needed to context-switch between them. Additionally, when backfilling, KV will end up scanning the same per-vBucket index 999 times, once for each collection. A user would be likely to see much better DCP throughput (and lower resource usage on the KV nodes) if just a single Stream was used for all collections.

I recall there was some OSO discussion related to the fact that one can only use OSO for a single collection - however

As such, could you review your logic on how you setup streams for multiple collections and see if you can consolidate the number when there's a particularly large number of collections being requested?

Paolo Cocchi October 6, 2023 at 5:42 AM

, extra information on the 999c test.
I've managed to pull a partial 'mcstat dcp' from a live 999c run and this is what I got:

% less mcstat_dcp.txt | grep "passthrough" | wc -l 367640
% less mcstat_dcp.txt | grep -E "cbas.*passthrough" | tr -s ' ' | cut -d ':' -f 2,3,4,5,6 | sort | uniq -c 1 15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:24sid 2 15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:25sid 3 15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:26sid 4 15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:27sid 5 15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:28sid 6 15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:30sid 7 15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:31sid 8 15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:16sid 9 15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:17sid 10 15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:18sid 11 15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:21sid 12 15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:22sid 13 15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:23sid 14 15984 cbas:Local:bucket-1:958844694828f72ee559643efd888fad:11sid 15 15480 cbas:Local:bucket-1:958844694828f72ee559643efd888fad:15sid 16 15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:0sid 17 15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:1sid 18 15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:2sid 19 15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:3sid 20 15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:4sid 21 15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:5sid 22 15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:6sid 23 15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:7sid

So cbas is opening a bunch of connections. Then they open ~16k streams per connection.

CB robot October 6, 2023 at 3:24 AM

Build capella-analytics-1.0.0-1040 contains kv_engine commit c3a30e8 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-58742#icft=MB-58742: ConnAggStatBuilder handles "eq_dcpq:<conn_type>" name format

Mohammad Zaeem October 5, 2023 at 1:40 PM
Edited

Here are the following results obtained from running dcpdrain tests locally:

Test Environment:
Running on a local single node cluster with 10 million documents and ~4% resident ratio.

The following command was used to load documents onto the cluster:

cbc-pillowfight --spec="couchbase://127.0.0.1:12000/default" --username=Administrator --password=asdasd --set-pct=100 --min-size=1024 --max-size=1024 --random-body --populate-only --num-items=10000000

 

Results:

Test

dcpdrain Command

Ingestion Rate

dcp_backfill_byte_limit=20MB

./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2

 57803

dcp_backfill_byte_limit=20MB

./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2

58823

dcp_backfill_byte_limit=2MB

./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2

57471

dcp_backfill_byte_limit=2MB

./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2

58479

dcp_backfill_byte_limit=20MB
number-connections=10

./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2 --num-connections=10

142857

dcp_backfill_byte_limit=20MB
number-connections=10

./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2 --num-connections=10

140845

dcp_backfill_byte_limit=2MB
number-connections=10

./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2 --num-connections=10

161290

dcp_backfill_byte_limit=2MB
number-connections=10

./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2 --num-connections=10

149253

 

Further emphasising what has said. There is no regression on dcp_backfill_byte_limit=2MB when the number of collections is small. However, we can also see that the average performance is slightly improved when dcp_backfill_byte_limit=2MB throughout the various tests.

Next step is to test with a high number of collection and dcp streams locally to hopefully reproduce the regression.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Bo-Chun Wang

Reporter

Is this a Regression?

Yes

Triage

Untriaged

Story Points

0

Sprint

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created September 19, 2023 at 9:39 PM
Updated October 7, 2024 at 6:45 PM
Resolved October 31, 2023 at 2:24 PM
Instabug