Currently the Status loop (which runs the health checks) runs every five minutes, which means that an issue might go unnoticed for up to five minutes, which could lead to inconsistent data in the dashboards and poor UX.
Ideas for how we could improve this:
- Just run the checkers more frequently - I'd rather not, since they could quickly overload clusters
- Split the checkers into "frequent" and "less frequent" groups that run at different intervals
- Re-run some checkers (those that only need "cluster summary" data and nothing else) as soon as the cluster summaries are updated (which is done by the Heart loop every minute)
- Related to that, possibly use streaming / long-polling for updating that data near-instantly