Achieve 300ms recovery time in Config Push

Description

Initialize ConfigPushHandler via DI instead of cross-cutting dependency via constructor.
When a ClusterNode goes down, it should immediately cancel all in-flight requests on that node. This could potentially let the system actually recover faster, as it doesn't have to wait for those requests to time out before retrying them against a new node.
Fallback/failsafe - given the remote possibility that we failed to apply a received update, we need to periodically check that our active config version matches the desired config version, then re-fetch if we're out of date.
Make sure 0xD is a signal to the fallback that we're "dirty" and need to refresh. In the edge case that we've bootstrapped against a config-only node.

Environment

None

Gerrit Reviews

None

Release Notes Description

None

Activity

Show:

Richard Ponton November 9, 2023 at 4:44 PM

It turns out that the code as-is is performant, but the test code needs to be updated.

Jeffry Morris November 3, 2023 at 4:13 PM

Be aware that if the node "goes down", the circuit breaker will/should be tripped and ops will be routed to the RetryOrchestrator and retried until failure, timeout or success. In a node swap/rebalance the circuit breaker may never trip because the client recieves a new config and adjusts quickly to the new topology. That being said, if your seeing ops timing out the full 2.5s w/out retry or circuit breaker tripping, that is suspect.

Richard Ponton November 3, 2023 at 7:45 AM

Right. We can't just trigger the cancellation token, because that would cancel the entire request. However, what I was seeing was that failures could take the full KV Timeout to fail, and that is unnecessary when the node is down.

If we could fast-cancel only the individual attempt, that would speed things up.

Jeffry Morris November 1, 2023 at 8:33 PM

When a ClusterNode goes down, it should immediately cancel all in-flight requests on that node. This could potentially let the system actually recover faster, as it doesn't have to wait for those requests to time out before retrying them against a new node.

They cannot just be canceled, they need to go through the retry loop until success, a non-recoverable failure, or timeout.

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details

Assignee

Richard Ponton(Deactivated)

Reporter

Richard Ponton(Deactivated)

Story Points

Sprint

None

Fix versions

3.4.14

Priority

Major

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created October 31, 2023 at 2:33 AM

Updated December 1, 2023 at 9:43 PM

Resolved December 1, 2023 at 9:43 PM

Configure

Instabug