Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 5.1.1
Component/s: ns_server, RESTful-APIs
Labels:
- monitoring

Description

When a node fails over in Couchbase Server, detection of the event is not the most intuitive or informative process. The best detail comes from /pools/default endpoint, but this requires internal understanding of Couchbase Server to be useful to monitor a cluster.

For example, when a node is healthy we get the following status stanza:

    "10.111.162.103:8091": {

      "status": "healthy",

      "clusterMembership": "active",

      "recoveryType": "none",

      "uptime": "8863"

The health status is best decoded from status and clusterMembership fields.

When the node is first seen to be uncontactable, these changes to healthy & inactiveFailed respectively:

   "10.111.162.103:8091": {

      "status": "healthy",

      "clusterMembership": "inactiveFailed",

      "recoveryType": "none",

      "uptime": "9643"

The "status" : "healthy" here is somewhat misleading. Finally after the autofailover of the node occurs, the fields show as Unhealthy & inactiveFailed:

    "10.111.162.103:8091": {

      "status": "unhealthy",

      "clusterMembership": "inactiveFailed",

      "recoveryType": "none",

      "uptime": "9643"

But with this information it cannot be determined that the node failover was automatic, or what was the reason for the failover (node timeout, or other/future autofailover reason). Likewise for a manual failover, the type of failover Hard/Graceful would be extremely useful.

Perhaps the introduction of failoverType and failoverDetail fields could work together in this regard?

`failoverType`	`manual`/`auto`
`failoverDetail`	In case of manual failover: `hard`/`graceful` In the case of auto failover: Reason such as `nodeTimeout`

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Ajit Yagaty [X] (Inactive)

Reporter:: Phil Stott (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Jun/18 11:19 AM

Updated:: 18/Jun/18 11:19 AM

Gerrit Reviews

There are no open Gerrit changes

REST API: More detailed node health information to allow detecting of failover and reason

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty