Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
None
Description
When a health check reports an Alert status we surface this in the dashboards prominently, however we don't do anything for when the check fails to run and reports an error - we log the error and increment a counter (multimanager_checker_errored), which triggers an Alertmanager alert, but we don't highlight this anywhere, neither in the UI (the checker doesn't appear at all), nor in Grafana. This is an issue because it gives you a false sense of security when your cluster might be having issues bad enough that checkers are failing to run altogether.
I encountered this when testing CMOS-154 (long DCP names) - the names I got were so long that memcached failed to send the stats reply which meant the checker failed to run, but this wasn't surfaced anywhere user-facing.
Ideas:
- Add a stat on the cluster overview dashboard to show health check errors
- Possibly complement the checker_errored counter with an "errors on last run" gauge, which may be better to show in ^ (the counter may be meaningless without the context of the heartbeat interval, as it may lump together several runs' worth of failures in one)
- Expose errored as its own checker status, which will make it show up in the dashboards alongside the checker results
- Once self-log ingestion is back (CMOS-159), we can have a logs panel showing the errors