Uploaded image for project: 'Couchbase Monitoring and Observability Stack'
  1. Couchbase Monitoring and Observability Stack
  2. CMOS-228

Code-less alerts for specific CBS versions with known issues

    XMLWordPrintable

Details

    Description

      There are cases when a serious bug affects / could affect all CBS installs running a certain version, for example the Log4j CVE, MB-47502 (memory leak in the CB-embedded Prometheus), or MB-48783 (offline upgrade from certain versions on Ubuntu/Debian can corrupt the cluster config).

      While we could write bespoke checkers for each of them, it may be either impossible or infeasible in certain cases - for example, the best we could do for the Prometheus issue would be to check its memory usage, but that would only fire an alert once Prometheus memory usage has grown beyond a threshold, which can be highly variable depending on cluster size and other factors.

      Much easier would be to simply have an alert for "the version you're running has a serious bug, you should upgrade soon/ASAP". We could (and IMHO should) also extend it to allow checking which services are running - for example the Log4j issue only affects clusters running Analytics, so we don't need to fire an alert if the user isn't running it (arguably we should always be encouraging upgrades, but there are better ways to do that). Ideally we'd also allow checking what OS the user is running, however in some cases CBS will report "x86_64-unknown-linux-gnu" (cf. MB-26154) so we'd need to find another way.

      If possible, we should do this using JSON/YAML instead of Go code - it means adding a new known issue is much easier. It also opens the door to updating this separately from Cluster Monitor / CMOS in the future, which would mean users can get alerted of new issues much faster.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Unassigned Unassigned
            marks.polakovs Marks Polakovs (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty