Details

    • New Feature
    • Resolution: Unresolved
    • Major
    • 0.3
    • None
    • cluster-monitor, cmos
    • None

    Description

      Filing this moreso we don't forget about it, rather than a specific implementation plan or spec; will likely need refining.

      It's not hard to imagine a situation where more than one checker would go off at the same time with the same root cause. For example, taking down a node is the classic case: you'd get pings for the node being down, the cluster not being fully active, as well as potentially missing active/replica vBuckets - three alerts for the same root cause.

      This could happen in more subtle cases as well - for example, the issue described in CMOS-377 could manifest itself both as an entry in the memcached log as well as a detectable condition by querying Analytics. We'd ideally suppress the former, as the latter would be more specific.

      Alertmanager has a system like this: https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule. We should look into whether it'll be useful for us, or take hints from it if it isn't.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Unassigned Unassigned
            marks.polakovs Marks Polakovs (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty