Avg. ingestion rate of 999 collections drops from 195K items/s to 57K items/s on build 7.6.0-1516

Description

Avg. ingestion rate (items/sec), 4 nodes, BigFUN 20M users (320M docs), 999 indexes, SSD, s=1 c=999

Build	Ingestion rate	Job
7.6.0-1483	195,230	http://perf.jenkins.couchbase.com/job/oceanus/12072/
7.6.0-1516	58,413	http://perf.jenkins.couchbase.com/job/oceanus/12117/
7.6.0-1516	57,183	http://perf.jenkins.couchbase.com/job/oceanus/12120/

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Linked issues

is caused by

MB-59832

Default dcp_backfill_byte_limit value (20MB) might cause readyQ spikes to above the Bucket Quota

is triggering

MB-58868

cbstats timeouts at cbcollect

relates

MB-58948

Number of ActiveStreams not exposed to Prometheus

Activity

Show:

Mohammad Zaeem October 31, 2023 at 2:20 PM

Resolving as reverting dcp_backfill_byte_limt back to 20MB such that the regression is no longer visible. Closing this ticket and creating a new task to investigate why changing dcp_backfill_byte_limt to 2MB causes this regression to occur.

Dave Rigby October 25, 2023 at 2:39 PM

@Michael Blow As per @Paolo Cocchi's / @Mohammad Zaeem's analysis above, CBAS appears to be making one StreamRequest per vbucket per collection, for a total of 999 * 1024 = 1,022,976 ActiveStreams across the 3 KV nodes.

While not necessarily directly related to the regression here, that is an extremely large number of ActiveStream objects for KV to manage. For a start each one consumes ~1KB of heap; so that's over 1GB of Bucket quota being consumed just to manage all those connections; not to mention the work needed to context-switch between them. Additionally, when backfilling, KV will end up scanning the same per-vBucket index 999 times, once for each collection. A user would be likely to see much better DCP throughput (and lower resource usage on the KV nodes) if just a single Stream was used for all collections.

I recall there was some OSO discussion related to the fact that one can only use OSO for a single collection - however

(a) if you're streaming 999 collections from a single vBucket then a single by-seqno scan will be much quicker than 999x OSO scans,
(b) As ov https://couchbasecloud.atlassian.net/browse/MB-58531#icft=MB-58531 KV has relaxed the constraint that only a single collection can be returned via OSO.

As such, could you review your logic on how you setup streams for multiple collections and see if you can consolidate the number when there's a particularly large number of collections being requested?

Paolo Cocchi October 6, 2023 at 5:42 AM

@Mohammad Zaeem, extra information on the 999c test.
I've managed to pull a partial 'mcstat dcp' from a live 999c run and this is what I got:

% less mcstat_dcp.txt | grep "passthrough" | wc -l                                                                
  367640

% less mcstat_dcp.txt | grep -E "cbas.*passthrough" | tr -s ' ' | cut -d ':' -f 2,3,4,5,6 | sort | uniq -c
15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:24sid
15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:25sid
15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:26sid
15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:27sid
15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:28sid
15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:30sid
15984 cbas:Local:bucket-1:22fe0ad7a3d6122d8c77227ea2077d08:31sid
15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:16sid
15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:17sid
15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:18sid
15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:21sid
15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:22sid
15984 cbas:Local:bucket-1:841ea5fa34b955b8f20493ee8fc7c5d4:23sid
15984 cbas:Local:bucket-1:958844694828f72ee559643efd888fad:11sid
15480 cbas:Local:bucket-1:958844694828f72ee559643efd888fad:15sid
15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:0sid
15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:1sid
15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:2sid
15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:3sid
15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:4sid
15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:5sid
15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:6sid
15984 cbas:Local:bucket-1:fd37c01e9de4a6ee01073890375733ae:7sid

So cbas is opening a bunch of connections. Then they open ~16k streams per connection.

CB robot October 6, 2023 at 3:24 AM

Build capella-analytics-1.0.0-1040 contains kv_engine commit c3a30e8 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-58742#icft=MB-58742: ConnAggStatBuilder handles "eq_dcpq:<conn_type>" name format

Mohammad Zaeem October 5, 2023 at 1:40 PM
Edited

Here are the following results obtained from running dcpdrain tests locally:

Test Environment:
Running on a local single node cluster with 10 million documents and ~4% resident ratio.

The following command was used to load documents onto the cluster:

cbc-pillowfight --spec="couchbase://127.0.0.1:12000/default" --username=Administrator --password=asdasd --set-pct=100 --min-size=1024 --max-size=1024 --random-body --populate-only --num-items=10000000

Results:

Test	dcpdrain Command	Ingestion Rate
dcp_backfill_byte_limit=20MB	`./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2`	57803
dcp_backfill_byte_limit=20MB	`./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2`	58823
dcp_backfill_byte_limit=2MB	`./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2`	57471
dcp_backfill_byte_limit=2MB	`./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2`	58479
dcp_backfill_byte_limit=20MB number-connections=10	`./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2 --num-connections=10`	142857
dcp_backfill_byte_limit=20MB number-connections=10	`./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2 --num-connections=10`	140845
dcp_backfill_byte_limit=2MB number-connections=10	`./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2 --num-connections=10`	161290
dcp_backfill_byte_limit=2MB number-connections=10	`./dcpdrain -h localhost:12000 -b default -u Administrator -P asdasd --buffer-size 44739461 -a 0.2 --num-connections=10`	149253

Further emphasising what @Paolo Cocchi has said. There is no regression on dcp_backfill_byte_limit=2MB when the number of collections is small. However, we can also see that the average performance is slightly improved when dcp_backfill_byte_limit=2MB throughout the various tests.

Next step is to test with a high number of collection and dcp streams locally to hopefully reproduce the regression.

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Bo-Chun Wang

Reporter

Bo-Chun Wang

Is this a Regression?

Yes

Triage

Untriaged

Story Points

Sprint

None

Priority

Major

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created September 19, 2023 at 9:39 PM

Updated October 7, 2024 at 6:45 PM

Resolved October 31, 2023 at 2:24 PM

Configure

Instabug

Avg. ingestion rate of 999 collections drops from 195K items/s to 57K items/s on build 7.6.0-1516

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Linked issues

is caused by

is triggering

relates

Activity

Mohammad Zaeem October 31, 2023 at 2:20 PM

Dave Rigby October 25, 2023 at 2:39 PM

Paolo Cocchi October 6, 2023 at 5:42 AM

CB robot October 6, 2023 at 3:24 AM

Mohammad Zaeem October 5, 2023 at 1:40 PMEdited

Details

Assignee

Reporter

Is this a Regression?

Triage

Story Points

Sprint

Priority

Instabug

PagerDuty

Sentry

Zendesk Support

Mohammad Zaeem October 5, 2023 at 1:40 PM
Edited