Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53556

Record OS disk I/O metrics into prometheus

    XMLWordPrintable

Details

    • 1

    Description

      We often see issues from users along the lines of "why was this disk-based operation slow?"

      At present we have very limited information to diagnose these kinds of problems - all we have is iostat invoked at cbcollect_info time for a handful of runs; which is often long after any particular issue has occurred.

      This can make debugging such issues very challenging - we have to resort to application-level "indications" of a disk problem (e.g. KV-Engine "Slow runtime" for disk task log message, histograms of syscall durations) which are:

      • One step removed from the underlying problem (Customer: "But how do I know that the disk was slow - was it just your software?")
      • Not time-based (histograms of syscall durations)
      • Edge-triggered when something is sufficiently "bad" ("Slow runtime" messages tell you things exceeded some runtime threshold at time X, but don't tell you the behaviour then things are "good").

      Compare this to debugging other resourcing issues (CPU, memory) and we are in a much worse position as we have time-series numbers for them from sigar.

      This is made doubly-worse by the fact that disk I/O performance is often more variable than CPU - people virtualise disks sharing the same underlying physical resource, and/or use virtualised environments like AWS which impose IOP limits which can be non-uniform (disks are allowed to burst to IOPS X for some number of minutes per day).

      While we have lived with the current (lack of) disk stats for a long time, I do think we should try to do something about this as:

      a) It's still a significant burden for support / engineering when analysing customer issues
      b) With Capella we are the single entity responsible for monitoring the hardware we are running on, so can no longer fallback to asking the customer "what did your disk monitoring system say"?

      In terms of minimal requirements I would suggest the following:

      1. Cumulative number of bytes written to CB Data volume over time.
      2. Cumulative number of bytes read from CB Data volume over time, in Prometheus
      3. Metrics tracked in Prometheus similar to existing system metrics.

      As "nice to have" requirements if they are not too hard to add:

      1. Disk queue size over time (instantaneous sample of current size)
      2. Disk read latency.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              drigby Dave Rigby (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty