Uploaded image for project: 'Couchbase Monitoring and Observability Stack'
  1. Couchbase Monitoring and Observability Stack
  2. CMOS-327

Alerts for the same value and different thresholds should share the same name

    XMLWordPrintable

Details

    Description

      Both of these alerts have the same expression / PromQL query, the only difference is the value for the threshold.  Alert Names do not have to be unique in AlertManager, and you do not want them to be unique for alerts of the same type, this is because in AlertManager you can inhibit rules of the same type that are already firing.  

      - alert: CB90055-metadataOverhead-Warning
            expr: |
              (kv_total_memory_overhead_bytes / kv_ep_max_size) > 0.5 < 0.9
            for: 0m
            labels:
              job: couchbase_prometheus
              kind: bucket
              health_check_id: CB90055
              health_check_name: metadataOverhead
              cluster: '{{ $labels.cluster }}'
              node: '{{ $labels.instance }}'
              bucket: '{{ $labels.bucket }}'
              severity: warning
            annotations:
              title: "Metadata Overhead Above 50% on Bucket: {{ $labels.bucket }}, Node: {{ $labels.instance }}"
              description: The percentage of memory that is taken up by metadata is over 50%
              remediation: Increase memory allocation for bucket or change the evictionPolicy of the bucket from `Value-only` (be aware this will have an adverse effect on performance).    - alert: CB90055-metadataOverhead-Alert
            expr: |
              (kv_total_memory_overhead_bytes / kv_ep_max_size) >= 0.9
            for: 0m
            labels:
              job: couchbase_prometheus
              kind: bucket
              health_check_id: CB90055
              health_check_name: metadataOverhead
              cluster: '{{ $labels.cluster }}'
              node: '{{ $labels.instance }}'
              bucket: '{{ $labels.bucket }}'
              severity: critical
            annotations:
              title: "Metadata Overhead Above 90% on Bucket: {{ $labels.bucket }}, Node: {{ $labels.instance }}"
              description: The percentage of memory that is taken up by metadata is over 90%
              remediation: Increase memory allocation for bucket or change the evictionPolicy of the bucket from `Value-only` (be aware this will have an adverse effect on performance). 

      For example if both of these alerts were named "CB90055-metadataOverhead" and both alerts only had a single threshold not a between, we can setup the following generic rule in AlertManager

      inhibit_rules:
      - source_matchers:
        - severity="critical"
        target_matchers:
        - severity="warning"
        - severity="info"
        equal: [ alertname, cluster_name ] 

      This simply says if an alert comes in as a critical and an alert with the same name against the same cluster comes in with a severity of warning or info, simply ignore and silence the alert as there is one of a higher priority firing already.  

      This also makes changing thresholds easier, each rule can be independent and not have to deal with figuring an upper and lower bound.

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Unassigned Unassigned
            aaron.benton Aaron Benton (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty