If checkpoints are able to grow - e.g. slow DCP client and plentiful RAM calling getNumItemsForCursor becomes noticeable slow, the function is O(n) n = items.
This function is called from many stat paths and in 7.x periodically called from the prometheus gathering (civetweb threads).
The function call itself takes the CheckpointManager::queueLock, which is required by many paths of KV, queueing new items, querying high-seqno and so on.
On a MacBook (2.6 GHz 6-Core Intel Core i7) some rough benchmarking shows that when 300k items are stored, the function takes 20ms and when combined with many cursors and many vbuckets all in a similar state, the prometheus scrape takes a significant amount of time - impacting many parts of KV as noted by
MB-57296 and MB-56891 (see all linked issues).
This ticket tracks any solution we may wish to apply in neo.
Note that the issue here is not expected to be a problem in master branch because the function is O(1) as per this change - https://review.couchbase.org/c/kv_engine/+/179571
|The computation count for the items remaining DCP/Checkpoint stats exposed to Prometheus was the O(N) function. Where N is the number of items in a checkpoint. This caused various performance issues including Prometheus stats timeouts when checkpoints accumulated a high number of items.||The computation count has been optimized and now is O(1).|
How to check for this issue
If both of the following symptoms are seen in the same time period (on an affected version) then you have likely hit this issue:
- Persistent timeout of prometheus metric requests on the KV endpoint (note this will typically result in gaps in the KV-Engine prometheus metrics for the affected time period):
- KV-Engine reports clock jumps (seen due to internal contention to schedule the clock check thread):