Description
In discussion with Aaron Benton he pointed out the need to have a consistent set of label names for all our metrics, which then feed into the variables our dashboards use and the labels set on the alerts we fire.
Working backwards, here's the current schema for our Alertmanager alerts:
job: couchbase_cluster_monitor | couchbase_prometheus | couchbase_fluent_bit
|
kind: cluster | node | bucket
|
severity: info | warning | critical
|
health_check_id: CB99999
|
health_check_name: itsInternalName
|
cluster: the_cluster_name
|
node: [the_node_hostname_if_relevant]
|
bucket: [the_bucket_name_if_relevant]
|
A few things jump out here:
- Standardize on cluster vs. cluster_name
- kind is not immediately clear, perhaps "level" or words to that effect may work better?
- It may be useful to explicitly set category: prometheus on all our alerts, so that e.g. a customer can route Couchbase alerts to their DBAs but their Prometheus self-monitoring alerts to their observability team
- NB: we can't use job here, as customers set that to all sorts of things.
Some of these feed further back - namely cluster is set by our Prometheus scrape configurations (indeed, we mandate that it's set to the exact name used by CBS - and the add-cluster form does this automatically), so if we decide that cluster_name may be a better fit, we'd have to change it there too. It's worth also noting that, for Prometheus alerts, any labels returned by the executed PromQL are added to the alert (cf. CMOS-328), so where possible we should keep these the same. The complication there is instance, when so far we've used node - we need to decide on one or the other.
Once the relevant decisions have been made, they should be enforced by dashboard and rule linting.