KV throughput against Trinity drops during rebalances

Description

The KV performance against 7.6 when the cluster undergoes rebalances seems to be quite poor compared to previous server versions and often comes with a number of timed out operations.

Here we're using FIT-SIT against Capella 7.6, doing a 3 -> 5 node scale:

https://performance-sdk.couchbase.com:8080/situationalSingle?situationalRunId=212f85d4-625b-4f89-b435-9642cd1e252e&runId=b61880ea-b696-4f44-a447-199f05b57a52

You can see a period of 1.5-2mins where the throughput has dropped a lot - basically to 0. And there are 72 KV errors in this time (the performer needs an improvement to record the actual time these error happen, hence them not showing on the graph).

Compared to Capella 7.2 with the same test:

https://performance-sdk.couchbase.com:8080/situationalSingle?situationalRunId=9495c826-01d5-4af3-b4cf-88dbb716f4ab&runId=6d3fb2e0-26a1-46ce-8557-f06bc068e540

No errors, and very little throughput disturbance. 

 

This seems to happen for any scale up or down, and we see similar things for 7.6 on SDKD, though the performance does generally seem to be better with sdkd.

 

Note that in all test cases the SDK does eventually re-establish the same throughput as before the cluster change.

Environment

None

Gerrit Reviews

None

Release Notes Description

None

Attachments

3

Activity

Show:

Jeffry Morris April 10, 2024 at 7:12 PM

Will Broadbelt March 22, 2024 at 11:04 AM

Richard Ponton March 22, 2024 at 6:19 AM

, I have some scattershot fixes and extra logging in this change, if you care to test.

https://review.couchbase.org/c/couchbase-net-client/+/207566

I think it fixes the repeated NotMyVBucket problem, but not necessarily all of the timeouts.  I have theories on fixing the rest, but haven't gotten to them yet.

Jeffry Morris March 21, 2024 at 2:44 AM
Edited

See my notes in - we have a fix .

Will Broadbelt March 20, 2024 at 3:51 PM

- Generally the reuslts look better:

 
7.6 .net : https://performance-sdk.couchbase.com:8080/situationalRun?situationalRunId=31b001e9-a6a1-4bd2-aead-94bae44b9d8c
  + re run of 3 to 5: https://performance-sdk.couchbase.com:8080/situationalRun?situationalRunId=47274c15-65bf-4ea6-907e-37144dc0c267
 
 
7.2 .net: https://performance-sdk.couchbase.com:8080/situationalRun?situationalRunId=df50f5f6-2e5a-445a-8108-b213f344aedb
   +  rerun of 3 to 4 https://performance-sdk.couchbase.com:8080/situationalRun?situationalRunId=befe8f47-e8a2-440a-b8dd-60293be88d6d
 
Can see that there are errors in the tests that I subsequently re-ran and they had no issues. So they seem intermittent.. 
Though the 7.2 3 to 4 scale is pretty bad in that first test. I havent been able to recreate it again yet and get SDK logs.

Im going to keep rerunning to see if I can get it to happen again with logs.

Duplicate
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Story Points

Fix versions

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created March 12, 2024 at 10:58 AM
Updated October 25, 2024 at 2:10 PM
Resolved April 10, 2024 at 7:12 PM
Instabug