Uploaded image for project: 'Couchbase Monitoring and Observability Stack'
  1. Couchbase Monitoring and Observability Stack
  2. CMOS-190

Create alert definitions for each Cluster Monitor checker

    XMLWordPrintable

Details

    • Task
    • Resolution: Done
    • Major
    • 0.1
    • None
    • cluster-monitor
    • None

    Description

      Currently all cluster monitor alerts are done through six Prometheus alerts, one for each combination of cluster/node/bucket and warn/alert. This is great for deduplication, but means that we're limited to the metadata provided on the Prometheus metric labels - namely the name/ID of the failing health check and the cluster/node/bucket it's targeted at.

      Ideally we'd be able to have bespoke rules for each checker, with a summary and remediation in the alert itself, instead of having to link out to either the cluster monitor GUI or to the documentation.

      This will, however, mean having (currently) 33+ individual alert definitions. Personally I think the UX benefit makes it worthwhile, but it'd make sense to spend a little bit of time looking at if this can be automated in any way, for example by code analysis or extracting the alert definitions into a JSON file or similar.

      The limitation of this approach would be that we can only interpolate the metadata available through labels - so we could have the name of the cluster that failed a check, but not e.g. the names of the specific indexes that failed noIndexRedundancy. As far as I'm aware the only way we could do this without packing per-checker metadata into Prometheus labels (which is not only a horrible anti-pattern, but also introduces unbounded time series cardinality) would be by pushing our own alerts to Alertmanager - likely something to consider post-0.1.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            marks.polakovs Marks Polakovs (Inactive)
            marks.polakovs Marks Polakovs (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty