Details
-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
None
Description
Currently there's no way for outside users to know that a cluster heartbeat failed, except by querying the REST API.
We should expose heartbeat related stats to Prometheus. I'm considering:
- Counter of failures (can alert on increase)
- Current status of each cluster as a gauge (can show in Grafana)
All of course appropriately labelled with the cluster.