Avg. ingestion rate of 999 collections drops from 195K items/s to 57K items/s on build 7.6.0-1516

Description

Avg. ingestion rate (items/sec), 4 nodes, BigFUN 20M users (320M docs), 999 indexes, SSD, s=1 c=999

Build	Ingestion rate	Job
7.6.0-1483	195,230	http://perf.jenkins.couchbase.com/job/oceanus/12072/
7.6.0-1516	58,413	http://perf.jenkins.couchbase.com/job/oceanus/12117/
7.6.0-1516	57,183	http://perf.jenkins.couchbase.com/job/oceanus/12120/

Components

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Linked issues

is caused by

MB-59832

Default dcp_backfill_byte_limit value (20MB) might cause readyQ spikes to above the Bucket Quota

is triggering

MB-58868

cbstats timeouts at cbcollect

relates

MB-58948

Number of ActiveStreams not exposed to Prometheus

Activity

Show:

Mohammad Zaeem October 31, 2023 at 2:20 PM

Resolving as reverting dcp_backfill_byte_limt back to 20MB such that the regression is no longer visible. Closing this ticket and creating a new task to investigate why changing dcp_backfill_byte_limt to 2MB causes this regression to occur.

Dave Rigby October 25, 2023 at 2:39 PM

As per 's / 's analysis above, CBAS appears to be making one StreamRequest per vbucket per collection, for a total of 999 * 1024 = 1,022,976 ActiveStreams across the 3 KV nodes.

While not necessarily directly related to the regression here, that is an extremely large number of ActiveStream objects for KV to manage. For a start each one consumes ~1KB of heap; so that's over 1GB of Bucket quota being consumed just to manage all those connections; not to mention the work needed to context-switch between them. Additionally, when backfilling, KV will end up scanning the same per-vBucket index 999 times, once for each collection. A user would be likely to see much better DCP throughput (and lower resource usage on the KV nodes) if just a single Stream was used for all collections.

I recall there was some OSO discussion related to the fact that one can only use OSO for a single collection - however

(a) if you're streaming 999 collections from a single vBucket then a single by-seqno scan will be much quicker than 999x OSO scans,
(b) As ov KV has relaxed the constraint that only a single collection can be returned via OSO.

As such, could you review your logic on how you setup streams for multiple collections and see if you can consolidate the number when there's a particularly large number of collections being requested?

Paolo Cocchi October 6, 2023 at 5:42 AM

, extra information on the 999c test.
I've managed to pull a partial 'mcstat dcp' from a live 999c run and this is what I got:

So cbas is opening a bunch of connections. Then they open ~16k streams per connection.

CB robot October 6, 2023 at 3:24 AM

Build capella-analytics-1.0.0-1040 contains kv_engine commit c3a30e8 with commit message:
: ConnAggStatBuilder handles "eq_dcpq:<conn_type>" name format

Mohammad Zaeem October 5, 2023 at 1:40 PM
Edited

Here are the following results obtained from running dcpdrain tests locally:

Test Environment:
Running on a local single node cluster with 10 million documents and ~4% resident ratio.

The following command was used to load documents onto the cluster:

Results:

Test	dcpdrain Command	Ingestion Rate
dcp_backfill_byte_limit=20MB		57803
dcp_backfill_byte_limit=20MB		58823
dcp_backfill_byte_limit=2MB		57471
dcp_backfill_byte_limit=2MB		58479
dcp_backfill_byte_limit=20MB number-connections=10		142857
dcp_backfill_byte_limit=20MB number-connections=10		140845
dcp_backfill_byte_limit=2MB number-connections=10		161290
dcp_backfill_byte_limit=2MB number-connections=10		149253

Further emphasising what has said. There is no regression on dcp_backfill_byte_limit=2MB when the number of collections is small. However, we can also see that the average performance is slightly improved when dcp_backfill_byte_limit=2MB throughout the various tests.

Next step is to test with a high number of collection and dcp streams locally to hopefully reproduce the regression.

Fixed

Details

Assignee

Bo-Chun Wang

Reporter

Bo-Chun Wang

Is this a Regression?

Yes

Triage

Untriaged

Story Points

Sprint

None

Priority

Major

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Configure

Instabug

Avg. ingestion rate of 999 collections drops from 195K items/s to 57K items/s on build 7.6.0-1516

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Linked issues

is caused by

is triggering

relates

Activity

Mohammad Zaeem October 31, 2023 at 2:20 PM

Dave Rigby October 25, 2023 at 2:39 PM

Paolo Cocchi October 6, 2023 at 5:42 AM

CB robot October 6, 2023 at 3:24 AM

Mohammad Zaeem October 5, 2023 at 1:40 PMEdited

Details

Assignee

Reporter

Is this a Regression?

Triage

Story Points

Sprint

Priority

Instabug

PagerDuty

Sentry

Zendesk Support

Mohammad Zaeem October 5, 2023 at 1:40 PM
Edited