Achieve 300ms recovery time in Config Push

Description

  1. Initialize ConfigPushHandler via DI instead of cross-cutting dependency via constructor.

  2. When a ClusterNode goes down, it should immediately cancel all in-flight requests on that node.  This could potentially let the system actually recover faster, as it doesn't have to wait for those requests to time out before retrying them against a new node.

  3. Fallback/failsafe - given the remote possibility that we failed to apply a received update, we need to periodically check that our active config version matches the desired config version, then re-fetch if we're out of date.

  4. Make sure 0xD is a signal to the fallback that we're "dirty" and need to refresh.  In the edge case that we've bootstrapped against a config-only node.

Environment

None

Gerrit Reviews

None

Release Notes Description

None

Activity

Show:

Richard Ponton November 9, 2023 at 4:44 PM

It turns out that the code as-is is performant, but the test code needs to be updated.

Jeffry Morris November 3, 2023 at 4:13 PM

Be aware that if the node "goes down", the circuit breaker will/should be tripped and ops will be routed to the RetryOrchestrator and retried until failure, timeout or success. In a node swap/rebalance the circuit breaker may never trip because the client recieves a new config and adjusts quickly to the new topology. That being said, if your seeing ops timing out the full 2.5s w/out retry or circuit breaker tripping, that is suspect.

Richard Ponton November 3, 2023 at 7:45 AM

Right.  We can't just trigger the cancellation token, because that would cancel the entire request.  However, what I was seeing was that failures could take the full KV Timeout to fail, and that is unnecessary when the node is down.

 

If we could fast-cancel only the individual attempt, that would speed things up.

Jeffry Morris November 1, 2023 at 8:33 PM

When a ClusterNode goes down, it should immediately cancel all in-flight requests on that node. This could potentially let the system actually recover faster, as it doesn't have to wait for those requests to time out before retrying them against a new node.

They cannot just be canceled, they need to go through the retry loop until success, a non-recoverable failure, or timeout.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Story Points

Sprint

Fix versions

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created October 31, 2023 at 2:33 AM
Updated December 1, 2023 at 9:43 PM
Resolved December 1, 2023 at 9:43 PM
Instabug