Loading...

XML

Word

Printable

Details

Type: New Feature
Resolution: Fixed
Priority: Major
Fix Version/s: 2.8.4
Affects Version/s: 2.7.5
Component/s: docs, library
Labels:
None

Epic Link:
Cluster Health Check
Sprint:
SDK 45: IPv6, HC, SDK 47: HC, Log Redact, SDK49: HC, Log Reda, CertAuth

Description

In some deployments, particularly cloud deployments that may have network setups that are beyond the user's control (ex.: Azure), a connection may be terminated after an amount of idle time. Since it may be terminated without actually sending a FIN, recovery can be troublesome.

Users have occasionally implemented a regular ping of a 'health check' by retrieving a single document. The problem with that is that the single document does not check the health of all connections for a cluster.

The request here is to add a health check function that would, for the current configuration and all existing open connections and all services, dispatch a NOOP request and verify the response.

If a connection is unhealthy (i.e., no response is received after a timeout), return that in the health check response. Optionally, schedule a reconnection and even drive that reconnection.

I think the signature here should probably be along the lines of:

request: health_check(void)

response =>

{"services": {"kv": [{"10.1.2.3": true}, {"10.1.2.4": false}],  "query": [{"10.1.2.3": true}]} "details": {"kv": [{"10.1.2.4": "NOOP (0x69) operation timed out after 2500µsec"}]}}

…but I'm flexible on this and it should be reviewed with others. My thought on the above is that it makes it easy to iterate for failures (is anything false?), and also easy to find details. Could have the details be inline with the failure though.

Attachments

Issue Links

blocks

JSCBC-390 add health check call to libuv io plugin

Resolved

NCBC-1573 add health check functionality

Resolved

PCBC-497 add health check function into lcb check

Resolved

PYCBC-412 add health check function into lcb check

Resolved

mentioned in: Page Loading...

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: CCBC-801
#	Subject	Branch	Project	Status	CR	V
82065,4	CCBC-801: Implement NOOP command and 'cbc-pillowfight --noop'	master	libcouchbase	Status: MERGED	+2	+1
82313,4	CCBC-801: Fixes for NOOP handler	master	libcouchbase	Status: MERGED	+2	+1
82314,7	CCBC-801: lcb_ping3() for checking cluster health	master	libcouchbase	Status: MERGED	+2	+1
85780,12	CCBC-801: Expose information about network IO for monitoring	master	libcouchbase	Status: MERGED	+2	+1
85781,12	CCBC-801: allow to dump health report in cbc-proxy	master	libcouchbase	Status: MERGED	+2	+1
86904,2	CCBC-801: Rename lcb_health to lcb_diag	master	libcouchbase	Status: MERGED	+2	+1