Loading...

XML

Word

Printable

Details

Type: New Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: 0.3
Affects Version/s: None
Component/s: cluster-monitor, cmos
Labels:
None

Description

Filing this moreso we don't forget about it, rather than a specific implementation plan or spec; will likely need refining.

It's not hard to imagine a situation where more than one checker would go off at the same time with the same root cause. For example, taking down a node is the classic case: you'd get pings for the node being down, the cluster not being fully active, as well as potentially missing active/replica vBuckets - three alerts for the same root cause.

This could happen in more subtle cases as well - for example, the issue described in CMOS-377 could manifest itself both as an entry in the memcached log as well as a detectable condition by querying Analytics. We'd ideally suppress the former, as the latter would be more specific.

Alertmanager has a system like this: https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule. We should look into whether it'll be useful for us, or take hints from it if it isn't.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Marks Polakovs (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 31/Mar/22 2:14 AM

Updated:: 08/Nov/22 3:48 AM

Checker Inhibitions