Loading...

XML

Word

Printable

Details

Type: Task
Resolution: Done
Priority: Major
Fix Version/s: 0.1
Affects Version/s: None
Component/s: cluster-monitor
Labels:
None

Epic Link:
0.1 Alerts

Description

Currently all cluster monitor alerts are done through six Prometheus alerts, one for each combination of cluster/node/bucket and warn/alert. This is great for deduplication, but means that we're limited to the metadata provided on the Prometheus metric labels - namely the name/ID of the failing health check and the cluster/node/bucket it's targeted at.

Ideally we'd be able to have bespoke rules for each checker, with a summary and remediation in the alert itself, instead of having to link out to either the cluster monitor GUI or to the documentation.

This will, however, mean having (currently) 33+ individual alert definitions. Personally I think the UX benefit makes it worthwhile, but it'd make sense to spend a little bit of time looking at if this can be automated in any way, for example by code analysis or extracting the alert definitions into a JSON file or similar.

The limitation of this approach would be that we can only interpolate the metadata available through labels - so we could have the name of the cluster that failed a check, but not e.g. the names of the specific indexes that failed noIndexRedundancy. As far as I'm aware the only way we could do this without packing per-checker metadata into Prometheus labels (which is not only a horrible anti-pattern, but also introduces unbounded time series cardinality) would be by pushing our own alerts to Alertmanager - likely something to consider post-0.1.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: CMOS-190
#	Subject	Branch	Project	Status	CR	V
166828,7	CMOS-190 Write bespoke alert rules for each checker	master	cbmultimanager	Status: MERGED	+2	+1
166933,2	CMOS-190 Fix use of incorrect Prometheus metrics	master	cbmultimanager	Status: MERGED	+2	+1

Activity

People

Assignee:: Marks Polakovs (Inactive)

Reporter:: Marks Polakovs (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/Nov/21 4:31 AM

Updated:: 03/Dec/21 3:20 PM

Resolved:: 02/Dec/21 9:13 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

CMOS-190 Write bespoke alert rules for each checker: Gerrit Review:

CMOS-190 Fix use of incorrect Prometheus metrics: Gerrit Review:

Create alert definitions for each Cluster Monitor checker

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty