Details
-
Improvement
-
Resolution: Done
-
Major
-
4.6.0
-
Mad-Hatter Code Complete
Description
Background
Currently the various timing histograms used by KV-Engine suffer when values are extremely small, or (generally more of a problem) when values are extremely large.
For example, the scheduler and runtimes histograms for ep-engine GlobalTask wait and runtimes max out at a "~17m -> Infinity" bucket - and while 17m is a long time there's a big difference between that and forever.
Similar issues exist with the mctimings output, where we show the timings of specific binary protocol commands. In addition to the "very large" results, we also have discontinuities due to us using a relatively naive bucketing - e.g. 10 microseconds, then every 1 millisecond:
[ 980 - 989]us ( 89.83%) 145 | #
|
[ 990 - 999]us ( 90.05%) 128 | #
|
[ 1 - 1]ms ( 96.62%) 3905 | ############################################
|
[ 2 - 2]ms ( 97.60%) 584 | ######
|
In addition to potentially being misleading (do many more operations take 1 milliseconds than take 990-999us?), it makes it harder to calculate percentiles - e.g. what it the 95th percentile above?
We should look to improve our timings:
- Can we unify on a single histogram / timing implementation? (we currently have at least two, one in memcached & one in ep-engine)
- Support a larger range of timings for GlobalTask scheduler and runtime histograms.
- Support a more continuous range of timings for commands, so we can easily calculate 95th, 99.7th, ... percentiles)
- Better export (e.g. rendering to a graph, import into other tools...)
Library to evaluate (others may be available):
- HDR Histogram Edit as of Vulcan we are using this for HiFi_MRU so is already available in the build.
- [Folly's Histogram and TimeseriesHistogram