Details
-
New Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
Description
Filing this moreso we don't forget about it, rather than a specific implementation plan or spec; will likely need refining.
It's not hard to imagine a situation where more than one checker would go off at the same time with the same root cause. For example, taking down a node is the classic case: you'd get pings for the node being down, the cluster not being fully active, as well as potentially missing active/replica vBuckets - three alerts for the same root cause.
This could happen in more subtle cases as well - for example, the issue described in CMOS-377 could manifest itself both as an entry in the memcached log as well as a detectable condition by querying Analytics. We'd ideally suppress the former, as the latter would be more specific.
Alertmanager has a system like this: https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule. We should look into whether it'll be useful for us, or take hints from it if it isn't.