Description
After spinning up examples/kubernetes and adding the cluster to the Cluster Monitor I immediately get 13 alerts. That's too many.
- One is the dead man's switch (always firing, used to check Prometheus can talk to Alertmanager), so 12 remaining
- 12 NodeCheckerWarnings
- Node-level: oneServicePerNode, nodeSwapUsage, sharedFilesystems, serviceStatus x number of nodes
All of those are legitimate health check warnings (I have a suspicion serviceStatus may be flakey, but that's a CMOS for another time), but I'm not convinced they all need to be fired into Slack/email right from the get-go. In my view, things that ping people (and possibly wake them up at night) should be "your cluster is on fire and you need to fix this" situations, while the bulk of those are "this isn't great, but your cluster will still be usable" situations - those can be prominently exposed in the Grafana dashboards, but perhaps not necessitate a Slack ping.
Perhaps at some point down the line we need to add a third tier between "warn" and "alert" - have the scale go from "this isn't great, but you can live with it" (e.g. oneServicePerNode), to "you should look at this at some point soon" (e.g. indexWithNoRedundancy or residentRatioTooLow at the "warn" threshold), all the way up to "you need to fix thisĀ now" (e.g. missingActiveVBuckets or critically low disk space). Perhaps this can be addressed as part of CMOS-82. For now though, we need to work within the constraints of the existing scale (good/warn/alert/info/missing)
Filing this Epic to begin discussing what we want alerts to look like for 0.1, with eventual sub-tasks for each actionable point.
Attachments
Issue Links
- relates to
-
CMOS-82 Review checker status meanings and integer values
- Done