Loading...

XML

Word

Printable

Details

Type: New Feature
Resolution: Fixed
Priority: Major
Fix Version/s: 2.8.4
Affects Version/s: 2.7.5
Component/s: docs, library
Labels:
None

Epic Link:
Cluster Health Check
Sprint:
SDK 45: IPv6, HC, SDK 47: HC, Log Redact, SDK49: HC, Log Reda, CertAuth

Description

In some deployments, particularly cloud deployments that may have network setups that are beyond the user's control (ex.: Azure), a connection may be terminated after an amount of idle time. Since it may be terminated without actually sending a FIN, recovery can be troublesome.

Users have occasionally implemented a regular ping of a 'health check' by retrieving a single document. The problem with that is that the single document does not check the health of all connections for a cluster.

The request here is to add a health check function that would, for the current configuration and all existing open connections and all services, dispatch a NOOP request and verify the response.

If a connection is unhealthy (i.e., no response is received after a timeout), return that in the health check response. Optionally, schedule a reconnection and even drive that reconnection.

I think the signature here should probably be along the lines of:

request: health_check(void)

response =>

{"services": {"kv": [{"10.1.2.3": true}, {"10.1.2.4": false}],  "query": [{"10.1.2.3": true}]} "details": {"kv": [{"10.1.2.4": "NOOP (0x69) operation timed out after 2500µsec"}]}}

…but I'm flexible on this and it should be reviewed with others. My thought on the above is that it makes it easy to iterate for failures (is anything false?), and also easy to find details. Could have the details be inline with the failure though.