Implement an in-flight operation limit to provide backpressure

Description

The current design allows a single connection from the connection pool to collect a large number of in-flight operations, rather than ensuring a more even spread of operations across connections in the pool.

This can be detrimental when mixing small GET operations with large GET operations, as the small operation may be blocked waiting on the large operation to pass over the network socket. If the small operation were sent on another connection it may complete sooner. A more even spread of operations doesn't guarantee this will be the case, but does make it more likely.

Additionally, when the connection pool is flooded with requests, for example for a batch operation, it tends to recover poorly. This is because many of the operations will timeout before they can complete but they have already been sent to the server. Therefore, these operations are still executed by the server and block other operations that were queued later. This effectively guarantees that the later operations will also timeout, even though they could have succeeded. If operations are instead queued on the client side and never sent if they timeout the SDK will recover better for later operations.

Environment

None

Gerrit Reviews

None

Release Notes Description

None

Activity

Show:

Brant Burnett May 24, 2024 at 4:37 PM

 

Based on this I filed a new issue: https://couchbasecloud.atlassian.net/browse/NCBC-3792#icft=NCBC-3792

Michael Reiche May 22, 2024 at 10:39 PM

I checked the support ticket and apparently the suggested changes fixed their issue.

#1 - With these settings, I’ve been running our load test on loop :

options.MaxKvConnections = 20; // default 5

options.Tuning.MaximumInFlightOperationsPerConnection = 2000; // default 8

I’ve not yet seen any timeouts or failures. The average ops/second is marginally lower but it’s only a few % different so could be entirely down to environmental differences at the time it’s run.

Michael Reiche May 22, 2024 at 10:16 PM

> And you can see the nastiness that would occur if the length of _statesInFlight is long here:

The number of iterations depends on the actual number of requests-in-flight, not the size of the array. But still, not good.

What I don't understand is why, even with MaximumInFlightOperationsPerConnection=2000, the customer was still reporting SendQueueFullExceptions. Unless the additional Trace logging was making everything that slow. Or those loops of up to 2000 iterations was so slow that it fell behind. The customer said this was on a test with 1200 concurrent operations. FYI - there are other use-cases with up to 100,000 concurrent operations. And the Java SDK can handle about 2000 concurrent operations per-connection before it fills its output buffer gets full.

Brant Burnett May 22, 2024 at 8:53 PM

 

Regarding raising the limit, while it may appear to disable https://couchbasecloud.atlassian.net/browse/NCBC-3489#icft=NCBC-3489, as written today it does so with a very significant performance impact, most likely far worse than the impact of throwing SendQueueFullException. Unfortunately, the implementation was based on the theory that this value may need some minor tweaking, but not disabling. The main issue is notated in code here:

https://github.com/couchbase/couchbase-net-client/blob/d1015befea8ac1852daec678ec9808c038ea5c48/src/Couchbase/Core/IO/Connections/InFlightOperationSet.cs#L17-L21

And you can see the nastiness that would occur if the length of _statesInFlight is long here:

https://github.com/couchbase/couchbase-net-client/blob/d1015befea8ac1852daec678ec9808c038ea5c48/src/Couchbase/Core/IO/Connections/InFlightOperationSet.cs#L116-L124

And here:

https://github.com/couchbase/couchbase-net-client/blob/d1015befea8ac1852daec678ec9808c038ea5c48/src/Couchbase/Core/IO/Connections/InFlightOperationSet.cs#L144-L163

Thus my suggestion is that we either look for other changes, such as connection pool size and a minor tweak to the in-flight maximum. If we do need to raise the in-flight maximum significantly, I think we need to make a code change beforehand. There are two possible changes:

  • Switch back to using a ConcurrentDictionary to manage in-flight operations. This entails more locking, hashing, etc, so for a small in-flight maximum it will be marginally less performant. But it won't have the giant cliff to fall off for large values.

  • Implement two variants of InFlightOperationSet and pick one based on the size of the in-flight maximum, using ConcurrentDictionary for any value above some threshold.

Michael Reiche May 22, 2024 at 4:27 PM

While the change works as designed and is indeed valuable (I'm looking into doing the same for the Java SDK), it changes the behavior. The most obvious changes is that it can result in SendQueueFullExceptions being thrown (and logged at debug) as part of the back-pressure. If a stacktrace is generated as part of the exception being thrown (I assume it is) - that it is significant amount of processing to add. Even if there is no stacktrace generated, just the exception being thrown - and subsequent operations being retried/managed by the application is a change in the behavior.

> I’d also advise against making the limit that large.

I believe that setting the limit to a large value effectively disables the functionality of https://couchbasecloud.atlassian.net/browse/NCBC-3489#icft=NCBC-3489 - which is what the customer wants.

> It doesn’t necessarily limit total in flight, just per connection.

It limits the total in flight to MaximumInFlightOperationsPerConnection * number-of-connections. And subsequent operations are backed up on the SendQueue which eventually becomes full, resulting in an exception being thrown and the application managing the operations that it could not put in the SendQueue.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Story Points

Components

Fix versions

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created September 11, 2023 at 12:45 PM
Updated May 24, 2024 at 4:37 PM
Resolved October 11, 2023 at 12:08 AM
Instabug