Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.1.5
-
Untriaged
-
0
-
Unknown
-
KV 2023-4
Description
If checkpoints are able to grow - e.g. slow DCP client and plentiful RAM calling getNumItemsForCursor becomes noticeable slow, the function is O(n) n = items.
This function is called from many stat paths and in 7.x periodically called from the prometheus gathering (civetweb threads).
The function call itself takes the CheckpointManager::queueLock, which is required by many paths of KV, queueing new items, querying high-seqno and so on.
On a MacBook (2.6 GHz 6-Core Intel Core i7) some rough benchmarking shows that when 300k items are stored, the function takes 20ms and when combined with many cursors and many vbuckets all in a similar state, the prometheus scrape takes a significant amount of time - impacting many parts of KV as noted by MB-57296 and MB-56891 (see all linked issues).
This ticket tracks any solution we may wish to apply in neo.
Note that the issue here is not expected to be a problem in master branch because the function is O(1) as per this change - https://review.couchbase.org/c/kv_engine/+/179571
Issue | Resolution |
The computation count for the items remaining DCP/Checkpoint stats exposed to Prometheus was the O(N) function. Where N is the number of items in a checkpoint. This caused various performance issues including Prometheus stats timeouts when checkpoints accumulated a high number of items. | The computation count has been optimized and now is O(1). |
How to check for this issue
If both of the following symptoms are seen in the same time period (on an affected version) then you have likely hit this issue:
- Persistent timeout of prometheus metric requests on the KV endpoint (note this will typically result in gaps in the KV-Engine prometheus metrics for the affected time period):
$ grep Scrape ns_server.prometheus.log | grep 'scrape_pool=kv_high.*context deadline exceeded'
level=debug ts=2023-11-21T06:22:02.161Z caller=scrape.go:1129 component="scrape manager" scrape_pool=kv_high_cardinality target="http://127.0.0.1:11280/_prometheusMetricsHigh?timeout=10000" msg="Scrape failed" err="Get \"http://127.0.0.1:11280/_prometheusMetricsHigh?timeout=10000\": context deadline exceeded"
- KV-Engine reports clock jumps (seen due to internal contention to schedule the clock check thread):
$ grep "WARNING system clock changed" memcached.log
2023-11-21T06:22:21.342070+00:00 WARNING system clock changed? Expected delta of 60s ±1 since last check, actual difference = 46s, memcached_epoch = 1700348601, memcached_uptime = 199140, new memcached_epoch = 1700348601, next check 199200
Attachments
Issue Links
- causes
-
MB-57990 ep_dcp_other_items_remaining can't reach 0 after hours on build 7.2.1-5878
-
- Closed
-
-
MB-56737 bucket doc stats not correct
-
- Closed
-
- is duplicated by
-
MB-56982 N1QL queries are failing as KV lookup couldn't finish in ~75s.
-
- Resolved
-
- relates to
-
MB-57296 BucketManager::forEach should not require exclusive access to the buckets
-
- Closed
-
-
MB-56891 Don't block updates to DcpConnmap while "dcp" and "dcpagg" stats are executing
-
- Closed
-
- mentioned in
-
Page Loading...