Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61476

Investigate if we can determine if a node not yet marked warmed by the janitor is truly healthy

    XMLWordPrintable

Details

    • Task
    • Resolution: Unresolved
    • Major
    • backlog
    • master
    • ns_server
    • None
    • 0

    Description

      Issue

      In a recent issue it was observed that a failure to shut down a bucket on one node during a configuration change could cause all nodes in the cluster to enter a degraded mode as the bucket was not warmed (write traffic was not enabled) by the janitor. Auto-failover could take no meaningful action as it considered all nodes to be unhealthy and hit a safety check preventing failover of all nodes, and the janitor could not progress until all nodes were healthy.

      MB-4030 was filed some time ago to consider partial janitoring, which would have helped this issue, but in an ideal world auto-failover could perhaps have also rectified this scenario by failing over the one unhealthy node.

      Current State/Considerations

      First, an outline of the two "health" statuses that interest us:

      1) Auto-failover helath monitoring (kv_monitor) - this monitor checks for DCP traffic to some bucket, and, in the absence of DCP connections or indication that one is unhealthy, check if memcached is "warmed". This first checks if ns_memcached is in the "warmed" state, and, if it is, checks if memcached wramup has actually completed (ep_warmup_thread = complete).

      2) UI servers monitoring ("pools/default" nodes status field) - checks if memcached is "warmed". This first checks if ns_memcached is in the "warmed" state, and, if it is, checks if memcached wramup has actually completed (ep_warmup_thread = complete).

      In the issue hit recently, all nodes were considered unhealthy, as ns_memcached had not yet entered the "warmed" state. It does so only after the bucket has been marked warmed by the janitor. ns_memcached would have remained in the "connected" state.

      Naive "Solution"

      Naively we could consider changing the health status check(s) to relax this "warmed" check in ns_memcached to "warmed" OR "connected" and to then rely on the state as seen by memcached itself. Applying this to the recent issue hit, we would have see that all bar the unhealthy node were green in the UI (which is not ideal as those nodes would not be able to serve traffic) and auto-failover would consider only the node that was actually unhealthy and unhealthy, and would attempt to fail over only that one node (which is good, as, it should be possible provided that the cluster was previously in a balanced/healthy state).

      Obviously the issue with the above is that the UI would display nodes that have not yet enabled traffic as healthy in the UI, even if they cannot yet serve writes. The two checks could perhaps be decoupled, allowing the UI to continue to display all nodes as yellow (unhealthy). However, there is still a potential issue with this change with regards to auto-failover. In a different scenario the janitor may attempt to mark the bucket as warmed on all nodes but one of them may fail. Should this happen, the memcached states may report the node as healthy, even though traffic has been enabled to them, and no auto-failover would occur (which is not desirable as it may have previously corrected the issue).

      More thought is required on this, there may be some solution would could identify if the node were truly unhealthy in this scenario and, as such, only consider the truly unhealthy nodes for auto-failover.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ben.huddleston Ben Huddleston
            ben.huddleston Ben Huddleston
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty