Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
5.1.1
Description
When a node fails over in Couchbase Server, detection of the event is not the most intuitive or informative process. The best detail comes from /pools/default endpoint, but this requires internal understanding of Couchbase Server to be useful to monitor a cluster.
For example, when a node is healthy we get the following status stanza:
"10.111.162.103:8091": {
|
"status": "healthy",
|
"clusterMembership": "active",
|
"recoveryType": "none",
|
"uptime": "8863"
|
}
|
The health status is best decoded from status and clusterMembership fields.
When the node is first seen to be uncontactable, these changes to healthy & inactiveFailed respectively:
"10.111.162.103:8091": {
|
"status": "healthy",
|
"clusterMembership": "inactiveFailed",
|
"recoveryType": "none",
|
"uptime": "9643"
|
}
|
The "status" : "healthy" here is somewhat misleading. Finally after the autofailover of the node occurs, the fields show as Unhealthy & inactiveFailed:
"10.111.162.103:8091": {
|
"status": "unhealthy",
|
"clusterMembership": "inactiveFailed",
|
"recoveryType": "none",
|
"uptime": "9643"
|
}
|
But with this information it cannot be determined that the node failover was automatic, or what was the reason for the failover (node timeout, or other/future autofailover reason). Likewise for a manual failover, the type of failover Hard/Graceful would be extremely useful.
Perhaps the introduction of failoverType and failoverDetail fields could work together in this regard?
failoverType | manual/auto |
---|---|
failoverDetail | In case of manual failover: hard/graceful In the case of auto failover: Reason such as nodeTimeout |