Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30158

REST API: More detailed node health information to allow detecting of failover and reason

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • 5.1.1
    • ns_server, RESTful-APIs

    Description

      When a node fails over in Couchbase Server, detection of the event is not the most intuitive or informative process. The best detail comes from /pools/default endpoint, but this requires internal understanding of Couchbase Server to be useful to monitor a cluster.

      For example, when a node is healthy we get the following status stanza:

          "10.111.162.103:8091": {
            "status": "healthy",
            "clusterMembership": "active",
            "recoveryType": "none",
            "uptime": "8863"
          }
      

      The health status is best decoded from status and clusterMembership fields.

      When the node is first seen to be uncontactable, these changes to healthy & inactiveFailed respectively:

         "10.111.162.103:8091": {
            "status": "healthy",
            "clusterMembership": "inactiveFailed",
            "recoveryType": "none",
            "uptime": "9643"
          }
      

      The "status" : "healthy" here is somewhat misleading. Finally after the autofailover of the node occurs, the fields show as Unhealthy & inactiveFailed:

          "10.111.162.103:8091": {
            "status": "unhealthy",
            "clusterMembership": "inactiveFailed",
            "recoveryType": "none",
            "uptime": "9643"
          }
      

      But with this information it cannot be determined that the node failover was automatic, or what was the reason for the failover (node timeout, or other/future autofailover reason). Likewise for a manual failover, the type of failover Hard/Graceful would be extremely useful.

      Perhaps the introduction of failoverType and failoverDetail fields could work together in this regard?

      failoverType manual/auto
      failoverDetail In case of manual failover: hard/graceful
      In the case of auto failover: Reason such as nodeTimeout

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ajit.yagaty Ajit Yagaty [X] (Inactive)
            phil.stott Phil Stott (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty