KV throughput against Trinity drops during rebalances

Description

The KV performance against 7.6 when the cluster undergoes rebalances seems to be quite poor compared to previous server versions and often comes with a number of timed out operations.

Here we're using FIT-SIT against Capella 7.6, doing a 3 -> 5 node scale:

https://performance-sdk.couchbase.com:8080/situationalSingle?situationalRunId=212f85d4-625b-4f89-b435-9642cd1e252e&runId=b61880ea-b696-4f44-a447-199f05b57a52

You can see a period of 1.5-2mins where the throughput has dropped a lot - basically to 0. And there are 72 KV errors in this time (the performer needs an improvement to record the actual time these error happen, hence them not showing on the graph).

Compared to Capella 7.2 with the same test:

https://performance-sdk.couchbase.com:8080/situationalSingle?situationalRunId=9495c826-01d5-4af3-b4cf-88dbb716f4ab&runId=6d3fb2e0-26a1-46ce-8557-f06bc068e540

No errors, and very little throughput disturbance.

This seems to happen for any scale up or down, and we see similar things for 7.6 on SDKD, though the performance does generally seem to be better with sdkd.

Note that in all test cases the SDK does eventually re-establish the same throughput as before the cluster change.

Environment

None

Gerrit Reviews

None

Release Notes Description

None

Attachments

Linked issues

is caused by

NCBC-3725

ConfigPushHandler doesn't degrade gracefully under massive config push spam

relates

NCBC-3734

NRE causes configs to be skipped while scaling up or down

NCBC-3732

ConfigPushHandler skipping all clustermap revisions

Activity

Show:

Jeffry Morris April 10, 2024 at 7:12 PM

Will Broadbelt March 22, 2024 at 11:04 AM

- I've run with your patch on the 3 -> 5 scale on 7.6 Capella, and am getting nmvb's https://performance-sdk.couchbase.com:8080/situationalSingle?situationalRunId=948db831-0efd-425d-86f6-06b944fca8d4&runId=14dac56c-fc79-45c9-9fa4-95d2296d6a9e

And logs:

Richard Ponton March 22, 2024 at 6:19 AM

, I have some scattershot fixes and extra logging in this change, if you care to test.

https://review.couchbase.org/c/couchbase-net-client/+/207566

I think it fixes the repeated NotMyVBucket problem, but not necessarily all of the timeouts. I have theories on fixing the rest, but haven't gotten to them yet.

Jeffry Morris March 21, 2024 at 2:44 AM
Edited

See my notes in - we have a fix .

Will Broadbelt March 20, 2024 at 3:51 PM

- Generally the reuslts look better:

7.6 .net : https://performance-sdk.couchbase.com:8080/situationalRun?situationalRunId=31b001e9-a6a1-4bd2-aead-94bae44b9d8c
+ re run of 3 to 5: https://performance-sdk.couchbase.com:8080/situationalRun?situationalRunId=47274c15-65bf-4ea6-907e-37144dc0c267

7.2 .net: https://performance-sdk.couchbase.com:8080/situationalRun?situationalRunId=df50f5f6-2e5a-445a-8108-b213f344aedb
+ rerun of 3 to 4 https://performance-sdk.couchbase.com:8080/situationalRun?situationalRunId=befe8f47-e8a2-440a-b8dd-60293be88d6d

Can see that there are errors in the tests that I subsequently re-ran and they had no issues. So they seem intermittent..
Though the 7.2 3 to 4 scale is pretty bad in that first test. I havent been able to recreate it again yet and get SDK logs.

Im going to keep rerunning to see if I can get it to happen again with logs.

Duplicate

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Will Broadbelt

Reporter

Will Broadbelt

Story Points

Fix versions

3.5.1

Priority

Test Blocker

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created March 12, 2024 at 10:58 AM

Updated October 25, 2024 at 2:10 PM

Resolved April 10, 2024 at 7:12 PM

Configure

Instabug

KV throughput against Trinity drops during rebalances

Description

Environment

Gerrit Reviews

Release Notes Description

Attachments

Linked issues

is caused by

relates

Activity

Jeffry Morris April 10, 2024 at 7:12 PM

Will Broadbelt March 22, 2024 at 11:04 AM

Richard Ponton March 22, 2024 at 6:19 AM

Jeffry Morris March 21, 2024 at 2:44 AMEdited

Will Broadbelt March 20, 2024 at 3:51 PM

Details

Assignee

Reporter

Story Points

Fix versions

Priority

Instabug

PagerDuty

Sentry

Zendesk Support

Jeffry Morris March 21, 2024 at 2:44 AM
Edited