Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.6.5, 6.6.6, 7.0.4, 7.1.2, 7.2.0
-
Untriaged
-
0
-
Yes
-
KV 2023-2
Description
The KV-Engine scheduler & runtimes histograms record the duration of those phases of task execution. We use Hdr1sfMicroSecHistogram for each of them, and this can record values between 0us to 60s with 1 s.f. precision.
However, if a schedule or runtime exceeds 60s, it is not recorded - while hdr_histogram itself will indicate if a sample is out of range when added via hdr_record_value, we don't check the return value:
/**
|
* Records a value in the histogram, will round this value of to a precision at or better
|
* than the significant_figure specified at construction time.
|
*
|
* @param h "This" pointer
|
* @param value Value to add to the histogram
|
* @return false if the value is larger than the highest_trackable_value and can't be recorded,
|
* true otherwise.
|
*/
|
bool hdr_record_value(struct hdr_histogram* h, int64_t value); |
...
|
bool Hdr1sfMicroSecHistogram::add(std::chrono::microseconds v, size_t count = 1) { |
return addValueAndCount(static_cast<uint64_t>(v.count()), |
static_cast<uint64_t>(count)); |
}
|
};
|
...
|
// ! Histograms of various task run times, one per Task. |
std::vector<Hdr1sfMicroSecHistogram> taskRuntimeHisto;
|
...
|
void KVBucket::logRunTime(TaskId taskType, |
const std::chrono::steady_clock::duration runTime) { |
auto ms = std::chrono::duration_cast<std::chrono::microseconds>(runTime);
|
stats.taskRuntimeHisto[static_cast<int>(taskType)].add(ms); |
}
|
Note the return value from Hdr1sfMicroSecHistogram::add() is ignored, so we don't know a value was omitted.
I haven't checked other HdrHistograms, but it wouldn't surprise me if we have this problem elsewhere.
—
While there's always a trade-off between what range / precision a histogram can track and the space it takes, I think we should at least add an "overflow" bucket (xxx -> inf) which allows us to see these samples have been omitted.
Doing this "correctly" - i.e. the overflow bucket is part of the percentile calculation - is potentially tricky from outside hdr_histogram as it is the thing which calculates percentiles etc. We might have to look at modifying hdr_histogram to track the overflow bucket itself.