XMLWordPrintable

Details

    Description

      When a health check reports an Alert status we surface this in the dashboards prominently, however we don't do anything for when the check fails to run and reports an error - we log the error and increment a counter (multimanager_checker_errored), which triggers an Alertmanager alert, but we don't highlight this anywhere, neither in the UI (the checker doesn't appear at all), nor in Grafana. This is an issue because it gives you a false sense of security when your cluster might be having issues bad enough that checkers are failing to run altogether.

      I encountered this when testing CMOS-154 (long DCP names) - the names I got were so long that memcached failed to send the stats reply which meant the checker failed to run, but this wasn't surfaced anywhere user-facing.

      Ideas:

      • Add a stat on the cluster overview dashboard to show health check errors
      • Possibly complement the checker_errored counter with an "errors on last run" gauge, which may be better to show in ^ (the counter may be meaningless without the context of the heartbeat interval, as it may lump together several runs' worth of failures in one)
      • Expose errored as its own checker status, which will make it show up in the dashboards alongside the checker results
      • Once self-log ingestion is back (CMOS-159), we can have a logs panel showing the errors

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Unassigned Unassigned
            marks.polakovs Marks Polakovs (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty