Details
-
New Feature
-
Resolution: Done
-
Major
-
None
-
None
Description
Migrated from https://github.com/couchbaselabs/observability/issues/7.
Demonstrate how someone could tune alerts and reporting, dashboards, etc. for a specific deployment.
An example for the microlith is in place to show how to use custom Prometheus alerts with defaults: https://github.com/couchbaselabs/observability/tree/06b40dd3d36e743521a1d9bd76b73a895d5fca78
This supports combining default and custom rules provided at runtime to Prometheus.
- /etc/prometheus/alerting
- couchbase <-- default rules
- custom <-- empty by default, add custom rules here
This allows us to completely override all the defaults or just extend them by mounting to these directories from the host, a config map, volume, etc.
We can even go into only overriding certain defaults in files if needs be: this should encourage a granular file-based break down of rules into separate files to make it easier to target specific ones.
Rather than overcomplicate things, for now provide substitution via a template file for tuning with an example for how to disable a rule by overriding the file (removing the rule from it).
This should then support the following:
- Default rules out of the box from Couchbase.
- Provide customer-specific rules on top of the defaults. Just mount them in via volumes, config maps, etc.
- Support tuning of any of these rules from defaults and custom by environment variable. Just provide the variables.
- Disable rules by overriding them - remove from the file providing them in the defaults. Again mount them in.
If we decide we want more than this (and we may want to pick it up anyway), we can adopt an approach like: https://github.com/lablabs/prometheus-alert-overrider
Pre-process all files according to a known format. That tool essentially uses another YAML override file to match with any rules found to then update them.
Tuning is available via environment variables with an example default rule for resident ratio:
expr: cbbucketstat_vbuckets_active_resident_items_ratio > $COUCHBASE_ACTIVE_RESIDENT_RATIO_ALERT_THRESHOLD
|
for: $COUCHBASE_ACTIVE_RESIDENT_RATIO_ALERT_DURATION |
At startup we default them in the entrypoint for Prometheus:
export COUCHBASE_ACTIVE_RESIDENT_RATIO_ALERT_THRESHOLD=${COUCHBASE_ACTIVE_RESIDENT_RATIO_ALERT_THRESHOLD:-100} |
export COUCHBASE_ACTIVE_RESIDENT_RATIO_ALERT_DURATION=${COUCHBASE_ACTIVE_RESIDENT_RATIO_ALERT_DURATION:-1m}
|
For the example we override the default:
- COUCHBASE_ACTIVE_RESIDENT_RATIO_ALERT_THRESHOLD=75 # default is 100 |
If we want anything more complex than this we need to consider an approach of pre-processing similar to the linked examples.
Attachments
Issue Links
- relates to
-
CMOS-46 Update CMOS to use the alert-overrider approach
- Done