Uploaded image for project: 'Couchbase Monitoring and Observability Stack'
  1. Couchbase Monitoring and Observability Stack
  2. CMOS-341

Standardize label names (metrics, dashboard variables, alerts)

    XMLWordPrintable

Details

    • Task
    • Resolution: Unresolved
    • Major
    • 1.0
    • None
    • cmos
    • None

    Description

      In discussion with Aaron Benton he pointed out the need to have a consistent set of label names for all our metrics, which then feed into the variables our dashboards use and the labels set on the alerts we fire.

      Working backwards, here's the current schema for our Alertmanager alerts:

      job: couchbase_cluster_monitor | couchbase_prometheus | couchbase_fluent_bit
      kind: cluster | node | bucket
      severity: info | warning | critical
      health_check_id: CB99999
      health_check_name: itsInternalName
      cluster: the_cluster_name
      node: [the_node_hostname_if_relevant]
      bucket: [the_bucket_name_if_relevant]
      

      A few things jump out here:

      • Standardize on cluster vs. cluster_name
      • kind is not immediately clear, perhaps "level" or words to that effect may work better?
      • It may be useful to explicitly set category: prometheus on all our alerts, so that e.g. a customer can route Couchbase alerts to their DBAs but their Prometheus self-monitoring alerts to their observability team
        • NB: we can't use job here, as customers set that to all sorts of things.

      Some of these feed further back - namely cluster is set by our Prometheus scrape configurations (indeed, we mandate that it's set to the exact name used by CBS - and the add-cluster form does this automatically), so if we decide that cluster_name may be a better fit, we'd have to change it there too. It's worth also noting that, for Prometheus alerts, any labels returned by the executed PromQL are added to the alert (cf. CMOS-328), so where possible we should keep these the same. The complication there is instance, when so far we've used node - we need to decide on one or the other.

      Once the relevant decisions have been made, they should be enforced by dashboard and rule linting.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Unassigned Unassigned
            marks.polakovs Marks Polakovs (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty