Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: 1.0
Affects Version/s: None
Component/s: cluster-monitor, cmos
Labels:
- not-as-easy-as-it-looks

Description

When a health check reports an Alert status we surface this in the dashboards prominently, however we don't do anything for when the check fails to run and reports an error - we log the error and increment a counter (multimanager_checker_errored), which triggers an Alertmanager alert, but we don't highlight this anywhere, neither in the UI (the checker doesn't appear at all), nor in Grafana. This is an issue because it gives you a false sense of security when your cluster might be having issues bad enough that checkers are failing to run altogether.

I encountered this when testing ~~CMOS-154~~ (long DCP names) - the names I got were so long that memcached failed to send the stats reply which meant the checker failed to run, but this wasn't surfaced anywhere user-facing.

Ideas:

Add a stat on the cluster overview dashboard to show health check errors
Possibly complement the checker_errored counter with an "errors on last run" gauge, which may be better to show in ^ (the counter may be meaningless without the context of the heartbeat interval, as it may lump together several runs' worth of failures in one)
Expose errored as its own checker status, which will make it show up in the dashboards alongside the checker results
Once self-log ingestion is back (CMOS-159), we can have a logs panel showing the errors

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Unassigned

Reporter:: Marks Polakovs (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 13/Dec/21 4:40 AM

Updated:: 08/Apr/22 5:41 AM

Gerrit Reviews

There are no open Gerrit changes

Better surface health check errors

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty