Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: 7.6.0
Affects Version/s: 6.6.5, 7.1.4, 7.0.5, 7.2.0
Component/s: ns_server
Labels:

Story Points:
1

Description

We often see issues from users along the lines of "why was this disk-based operation slow?"

At present we have very limited information to diagnose these kinds of problems - all we have is iostat invoked at cbcollect_info time for a handful of runs; which is often long after any particular issue has occurred.

This can make debugging such issues very challenging - we have to resort to application-level "indications" of a disk problem (e.g. KV-Engine "Slow runtime" for disk task log message, histograms of syscall durations) which are:

One step removed from the underlying problem (Customer: "But how do I know that the disk was slow - was it just your software?")
Not time-based (histograms of syscall durations)
Edge-triggered when something is sufficiently "bad" ("Slow runtime" messages tell you things exceeded some runtime threshold at time X, but don't tell you the behaviour then things are "good").

Compare this to debugging other resourcing issues (CPU, memory) and we are in a much worse position as we have time-series numbers for them from sigar.

This is made doubly-worse by the fact that disk I/O performance is often more variable than CPU - people virtualise disks sharing the same underlying physical resource, and/or use virtualised environments like AWS which impose IOP limits which can be non-uniform (disks are allowed to burst to IOPS X for some number of minutes per day).

While we have lived with the current (lack of) disk stats for a long time, I do think we should try to do something about this as:

a) It's still a significant burden for support / engineering when analysing customer issues
b) With Capella we are the single entity responsible for monitoring the hardware we are running on, so can no longer fallback to asking the customer "what did your disk monitoring system say"?

In terms of minimal requirements I would suggest the following:

Cumulative number of bytes written to CB Data volume over time.
Cumulative number of bytes read from CB Data volume over time, in Prometheus
Metrics tracked in Prometheus similar to existing system metrics.

As "nice to have" requirements if they are not too hard to add:

Disk queue size over time (instantaneous sample of current size)
Disk read latency.

Attachments

Issue Links

backports to

MB-56351 [BP 7.2.x] Record OS disk I/O metrics into prometheus

Closed

MB-59237 [BP 7.2.x]: Record OS disk I/O metrics into prometheus

Closed

Activity

People

Assignee:: Ashwin Govindarajulu

Reporter:: Dave Rigby (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 31/Aug/22 6:51 AM

Updated:: 26/Jan/24 3:44 AM

Resolved:: 06/Apr/23 2:00 AM