Avg. ingestion rate of 999 collections drops from 195K items/s to 57K items/s on build 7.6.0-1516

Description

Avg. ingestion rate (items/sec), 4 nodes, BigFUN 20M users (320M docs), 999 indexes, SSD, s=1 c=999

Components

Affects versions

Fix versions

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Activity

Show:

Mohammad Zaeem October 31, 2023 at 2:20 PM

Resolving as reverting dcp_backfill_byte_limt back to 20MB such that the regression is no longer visible. Closing this ticket and creating a new task to investigate why changing dcp_backfill_byte_limt to 2MB causes this regression to occur.

Dave Rigby October 25, 2023 at 2:39 PM

As per 's / 's analysis above, CBAS appears to be making one StreamRequest per vbucket per collection, for a total of 999 * 1024 = 1,022,976 ActiveStreams across the 3 KV nodes.

While not necessarily directly related to the regression here, that is an extremely large number of ActiveStream objects for KV to manage. For a start each one consumes ~1KB of heap; so that's over 1GB of Bucket quota being consumed just to manage all those connections; not to mention the work needed to context-switch between them. Additionally, when backfilling, KV will end up scanning the same per-vBucket index 999 times, once for each collection. A user would be likely to see much better DCP throughput (and lower resource usage on the KV nodes) if just a single Stream was used for all collections.

I recall there was some OSO discussion related to the fact that one can only use OSO for a single collection - however

  • (a) if you're streaming 999 collections from a single vBucket then a single by-seqno scan will be much quicker than 999x OSO scans,

  • (b) As ov KV has relaxed the constraint that only a single collection can be returned via OSO.

As such, could you review your logic on how you setup streams for multiple collections and see if you can consolidate the number when there's a particularly large number of collections being requested?

Paolo Cocchi October 6, 2023 at 5:42 AM

, extra information on the 999c test.
I've managed to pull a partial 'mcstat dcp' from a live 999c run and this is what I got:

So cbas is opening a bunch of connections. Then they open ~16k streams per connection.

CB robot October 6, 2023 at 3:24 AM

Build capella-analytics-1.0.0-1040 contains kv_engine commit c3a30e8 with commit message:
: ConnAggStatBuilder handles "eq_dcpq:<conn_type>" name format

Mohammad Zaeem October 5, 2023 at 1:40 PM
Edited

Here are the following results obtained from running dcpdrain tests locally:

Test Environment:
Running on a local single node cluster with 10 million documents and ~4% resident ratio.

The following command was used to load documents onto the cluster:

 

Results:

Test

dcpdrain Command

Ingestion Rate

dcp_backfill_byte_limit=20MB

 57803

dcp_backfill_byte_limit=20MB

58823

dcp_backfill_byte_limit=2MB

57471

dcp_backfill_byte_limit=2MB

58479

dcp_backfill_byte_limit=20MB
number-connections=10

142857

dcp_backfill_byte_limit=20MB
number-connections=10

140845

dcp_backfill_byte_limit=2MB
number-connections=10

161290

dcp_backfill_byte_limit=2MB
number-connections=10

149253

 

Further emphasising what has said. There is no regression on dcp_backfill_byte_limit=2MB when the number of collections is small. However, we can also see that the average performance is slightly improved when dcp_backfill_byte_limit=2MB throughout the various tests.

Next step is to test with a high number of collection and dcp streams locally to hopefully reproduce the regression.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Bo-Chun Wang

Reporter

Is this a Regression?

Yes

Triage

Untriaged

Story Points

Sprint

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Loading...