Details
-
Improvement
-
Resolution: Done
-
Critical
-
7.0.0
-
1
-
KV-Engine Sprint 2021 June
Description
Prometheus encodes sample timestamps as a delta-of-deltas in milliseconds, e.g.,:
TS Delta DoD
|
1000
|
2000 1000
|
3000 1000 0
|
4010 1010 10
|
5010 1000 0
|
These delta-of-deltas are encoded in a variable number of bits in the chunk files.
If a DoD is exactly 0, it is encoded into a single bit. If greater than 0, the next bitwidth Prometheus will use is 14 bits, with a prefix. That is, if the DoD is even a single millisecond, the size on disk increases from 1 bit to 14 bits + 2 prefix bits.
Pulling data from the breakdown of a set of logs in MB-45843 (bearing in mind that this is only a single data point), it appears over the time covered by the logs, roughly 12% of sample DoDs are exactly 0 and 86% were encoded as 14+2 bits. The remainder were large enough to require more than 14 bits to encode (i.e., a relatively small number of samples had a DoD >~16s).
However, the vast majority (97%) of all DoDs could have been encoded in 5 bits, suggesting they were 31ms or lower. If a value could have been encoded in 5 bits, but needed to be padded out to the next predetermined bitwidth, 14 bits, 9 or more bits are essentially wasted.
To a degree, DoDs represent "jitter" in the scrape interval. If the interval was consistent to the millisecond, all DoDs would be exactly 0. Even the small jitter seen in most cases (<=31ms) increases the disk usage of a given sample significantly.
The Prometheus exposition format does have provision for an exporter to include a timestamp with each sample. This means KV can take control over the reported time for each sample.
The simplest method to increase how many DoDs are exactly zero would be for KV to round the sample time to the nearest 100ms. This means as long as Prometheus scrapes are received with an interval consistent to 1/10th of a second (which appears to be the case, based on the <=31ms DoDs for most samples), the computed DoD will be 0.
This means that the sample time stored in Prometheus could be up to 50ms away from the true sample time. Given that the scrape interval is typically 10 or more seconds, the error this deviation may introduce in e.g., rate calculations is likely an acceptable tradeoff.
This does come with the risk of scrapes coming in at times very near to a rounding boundary, and the reported interval appearing to flip flop up and down depending on whether the time was rounded up or down. However, a +/- 100ms DoD is representable in 7/8 bits, which will be expanded out to 14+2 bits - no worse than is seen without rounding.