Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.6.0, 7.1.4, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1, 7.1.5, 7.2.4, 7.1.7, 7.2.2, 7.1.6, 7.2.3, 7.2.5, 7.2.6, 7.6.1
-
Untriaged
-
0
-
Unknown
Description
Spinning out magma specific issue from MB-61376 (KV will still move stat "all" processing off worker thread)
Adding some trace spans and trying to force long checkpoints (big flush times) and I see that the "all" stats group is spending most of its time in MagmaKVStore::getAggrDbFileInfo which uses Magma::getStats
E.g. here's a slow op, i've formatted the trace output and I also (ab)used the some existing trace identifiers just for debug purposes - thus in the following output disassociate_bucket is the wrapped around the call to kvBucket->getFileStats which just gets the file stats via getAggrDbFileInfo
2024-04-03T11:07:15.136579+01:00 WARNING 282: Slow operation: {"bucket":"default","cid":"[::1]:59885/5c1496f8","command":"STAT","duration":"1476 ms","packet":{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientRequest","opaque":1544853240,"opcode":"STAT","vbucket":0},"peer":{"ip":"::1","port":59885},
|
|
"trace":
|
"throttled=1108939726321923:38
|
sync_write.prepare=1108939726360892:747
|
associate_bucket=1108939727109441:8
|
disassociate_bucket=1108939727118119:1473652 <- timing of kvBucket->getFileStats(collector);
|
storage_engine_stats=1108941201052251:929
|
update_privilege_context=1108939726321847:1475662 <- timing of EventuallyPersistentEngine::doEngineStatsLowCardinality
|
set.with.meta=1108941201985030:542
|
get.stats=1108939726317598:1476210
|
execute=1108939726308747:1476390
|
request=1108939726308747:1476390",
|
|
"worker_tid":4867}
|
In order to reproduce this issue I chose the following approach.
- Single node (cluster_run) with few vbuckets, e.g. 16, 6GB quota (lots of queueing space)
- Add a sleep to the ep-engine flusher (so that we gather lots of data in memory before flushing)
- 20s sleep for example
- Run pillowfight 100% write
- In a loop run cbstats all
After a few minutes plenty of slow STAT calls were logged and with extra tracing, all showed time was in getAggrDbFileInfo.