Details
-
Task
-
Resolution: Done
-
Major
-
None
-
None
Description
Currently all cluster monitor alerts are done through six Prometheus alerts, one for each combination of cluster/node/bucket and warn/alert. This is great for deduplication, but means that we're limited to the metadata provided on the Prometheus metric labels - namely the name/ID of the failing health check and the cluster/node/bucket it's targeted at.
Ideally we'd be able to have bespoke rules for each checker, with a summary and remediation in the alert itself, instead of having to link out to either the cluster monitor GUI or to the documentation.
This will, however, mean having (currently) 33+ individual alert definitions. Personally I think the UX benefit makes it worthwhile, but it'd make sense to spend a little bit of time looking at if this can be automated in any way, for example by code analysis or extracting the alert definitions into a JSON file or similar.
The limitation of this approach would be that we can only interpolate the metadata available through labels - so we could have the name of the cluster that failed a check, but not e.g. the names of the specific indexes that failed noIndexRedundancy. As far as I'm aware the only way we could do this without packing per-checker metadata into Prometheus labels (which is not only a horrible anti-pattern, but also introduces unbounded time series cardinality) would be by pushing our own alerts to Alertmanager - likely something to consider post-0.1.
Attachments
For Gerrit Dashboard: CMOS-190 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
166828,7 | CMOS-190 Write bespoke alert rules for each checker | master | cbmultimanager | Status: MERGED | +2 | +1 |
166933,2 | CMOS-190 Fix use of incorrect Prometheus metrics | master | cbmultimanager | Status: MERGED | +2 | +1 |