Details

    • Epic
    • Resolution: Done
    • Major
    • 0.1
    • None
    • cmos
    • None
    • 0.1 Alerts

    Description

      After spinning up examples/kubernetes and adding the cluster to the Cluster Monitor I immediately get 13 alerts. That's too many.

      • One is the dead man's switch (always firing, used to check Prometheus can talk to Alertmanager), so 12 remaining
      • 12 NodeCheckerWarnings
        • Node-level: oneServicePerNode, nodeSwapUsage, sharedFilesystems, serviceStatus x number of nodes

      All of those are legitimate health check warnings (I have a suspicion serviceStatus may be flakey, but that's a CMOS for another time), but I'm not convinced they all need to be fired into Slack/email right from the get-go. In my view, things that ping people (and possibly wake them up at night) should be "your cluster is on fire and you need to fix this" situations, while the bulk of those are "this isn't great, but your cluster will still be usable" situations - those can be prominently exposed in the Grafana dashboards, but perhaps not necessitate a Slack ping.

      Perhaps at some point down the line we need to add a third tier between "warn" and "alert" - have the scale go from "this isn't great, but you can live with it" (e.g. oneServicePerNode), to "you should look at this at some point soon" (e.g. indexWithNoRedundancy or residentRatioTooLow at the "warn" threshold), all the way up to "you need to fix thisĀ now" (e.g. missingActiveVBuckets or critically low disk space). Perhaps this can be addressed as part of CMOS-82. For now though, we need to work within the constraints of the existing scale (good/warn/alert/info/missing)

      Filing this Epic to begin discussing what we want alerts to look like for 0.1, with eventual sub-tasks for each actionable point.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              marks.polakovs Marks Polakovs (Inactive)
              marks.polakovs Marks Polakovs (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty