Loading...

XML

Word

Printable

Details

Type: Task
Resolution: Unresolved
Priority: Major
Fix Version/s: 1.0
Affects Version/s: None
Component/s: cmos
Labels:
None

Description

In discussion with Aaron Benton he pointed out the need to have a consistent set of label names for all our metrics, which then feed into the variables our dashboards use and the labels set on the alerts we fire.

Working backwards, here's the current schema for our Alertmanager alerts:

job: couchbase_cluster_monitor | couchbase_prometheus | couchbase_fluent_bit

kind: cluster | node | bucket

severity: info | warning | critical

health_check_id: CB99999

health_check_name: itsInternalName

cluster: the_cluster_name

node: [the_node_hostname_if_relevant]

bucket: [the_bucket_name_if_relevant]

A few things jump out here:

Standardize on cluster vs. cluster_name
kind is not immediately clear, perhaps "level" or words to that effect may work better?
It may be useful to explicitly set category: prometheus on all our alerts, so that e.g. a customer can route Couchbase alerts to their DBAs but their Prometheus self-monitoring alerts to their observability team
- NB: we can't use job here, as customers set that to all sorts of things.

Some of these feed further back - namely cluster is set by our Prometheus scrape configurations (indeed, we mandate that it's set to the exact name used by CBS - and the add-cluster form does this automatically), so if we decide that cluster_name may be a better fit, we'd have to change it there too. It's worth also noting that, for Prometheus alerts, any labels returned by the executed PromQL are added to the alert (cf. CMOS-328), so where possible we should keep these the same. The complication there is instance, when so far we've used node - we need to decide on one or the other.

Once the relevant decisions have been made, they should be enforced by dashboard and rule linting.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Unassigned

Reporter:: Marks Polakovs (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Mar/22 9:25 AM

Updated:: 08/Apr/22 5:41 AM

Gerrit Reviews

There are no open Gerrit changes

Standardize label names (metrics, dashboard variables, alerts)

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty