Rework WaitUntilReady

Description

Suggested release note:

The waitUntilReady method is now more aggressive about retrying failed pings. Also, waiting for a desired state of DEGRADED no longer fails when the client is fully connected to the cluster.

 

Investigate how WaitUntilReady could be improved.

Specifically:

  1. Rework the pacemaker. The current "wait some more" logic is odd, and results in redundant node health checks. Instead of being driven by a flux interval(), perhaps we could use a retry operator.

  2. Investigate whether pings are currently being retried, and look into why we're consulting the diagnostic results – should ping alone be sufficient?

Environment

None

Gerrit Reviews

None

Release Notes Description

None

Activity

Show:

David Nault June 15, 2023 at 10:11 PM
Edited

Currently, specifying desired state "DEGRADED" causes a timeout if the cluster state is actually fully "ONLINE". I suppose this is useful if you're actually waiting for the cluster to be degraded

Also, perhaps we should revisit what it means to be degraded. Currently a cluster qualifies as degraded if there is more than 1 endpoint, and at least one endpoint is ONLINE. The ONLINE endpoint could be for any service.

 

According to RFC, "Degraded" means "at least one socket per service is reachable". https://github.com/couchbaselabs/sdk-rfcs/blob/master/rfc/0061-sdk3-diagnostics.md#summary

 

David Nault June 15, 2023 at 9:43 PM

Currently, a failed ping does not cause WaitUntilReady to fail, as long as the endpoint connection was established. Hmmm...

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Story Points

Sprint

Fix versions

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created June 14, 2023 at 5:43 PM
Updated July 27, 2023 at 3:49 PM
Resolved July 27, 2023 at 3:49 PM
Instabug