Loading...

XML

Word

Printable

Details

Type: Epic
Resolution: Done
Priority: Major
Fix Version/s: 0.1
Affects Version/s: None
Component/s: cmos
Labels:
None

Epic Name:
0.1 Alerts

Description

After spinning up examples/kubernetes and adding the cluster to the Cluster Monitor I immediately get 13 alerts. That's too many.

One is the dead man's switch (always firing, used to check Prometheus can talk to Alertmanager), so 12 remaining
12 NodeCheckerWarnings
- Node-level: oneServicePerNode, nodeSwapUsage, sharedFilesystems, serviceStatus x number of nodes

All of those are legitimate health check warnings (I have a suspicion serviceStatus may be flakey, but that's a CMOS for another time), but I'm not convinced they all need to be fired into Slack/email right from the get-go. In my view, things that ping people (and possibly wake them up at night) should be "your cluster is on fire and you need to fix this" situations, while the bulk of those are "this isn't great, but your cluster will still be usable" situations - those can be prominently exposed in the Grafana dashboards, but perhaps not necessitate a Slack ping.

Perhaps at some point down the line we need to add a third tier between "warn" and "alert" - have the scale go from "this isn't great, but you can live with it" (e.g. oneServicePerNode), to "you should look at this at some point soon" (e.g. indexWithNoRedundancy or residentRatioTooLow at the "warn" threshold), all the way up to "you need to fix this now" (e.g. missingActiveVBuckets or critically low disk space). Perhaps this can be addressed as part of ~~CMOS-82~~. For now though, we need to work within the constraints of the existing scale (good/warn/alert/info/missing)

Filing this Epic to begin discussing what we want alerts to look like for 0.1, with eventual sub-tasks for each actionable point.

Attachments

Issue Links

relates to

CMOS-82 Review checker status meanings and integer values

Done

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Marks Polakovs (Inactive)

Reporter:: Marks Polakovs (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Nov/21 9:58 AM

Updated:: 10/Dec/21 1:58 AM

Resolved:: 10/Dec/21 1:58 AM

Gerrit Reviews

There are no open Gerrit changes

Review pre-defined alerts

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty