Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.2.5, 7.6.2
Affects Version/s: 7.6.0, 7.1.4, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1, 7.1.5, 7.2.4, 7.1.7, 7.2.2, 7.1.6, 7.2.3, 7.2.5, 7.2.6, 7.6.1
Component/s: storage-engine
Labels:

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
Unknown

Description

Spinning out magma specific issue from ~~MB-61376~~ (KV will still move stat "all" processing off worker thread)

Adding some trace spans and trying to force long checkpoints (big flush times) and I see that the "all" stats group is spending most of its time in MagmaKVStore::getAggrDbFileInfo which uses Magma::getStats

E.g. here's a slow op, i've formatted the trace output and I also (ab)used the some existing trace identifiers just for debug purposes - thus in the following output disassociate_bucket is the wrapped around the call to kvBucket->getFileStats which just gets the file stats via getAggrDbFileInfo

2024-04-03T11:07:15.136579+01:00 WARNING 282: Slow operation: {"bucket":"default","cid":"[::1]:59885/5c1496f8","command":"STAT","duration":"1476 ms","packet":{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientRequest","opaque":1544853240,"opcode":"STAT","vbucket":0},"peer":{"ip":"::1","port":59885},

"trace":

  "throttled=1108939726321923:38

   sync_write.prepare=1108939726360892:747

   associate_bucket=1108939727109441:8

   disassociate_bucket=1108939727118119:1473652 <-  timing of kvBucket->getFileStats(collector);

   storage_engine_stats=1108941201052251:929

   update_privilege_context=1108939726321847:1475662 <- timing of  EventuallyPersistentEngine::doEngineStatsLowCardinality

   set.with.meta=1108941201985030:542

   get.stats=1108939726317598:1476210

   execute=1108939726308747:1476390

   request=1108939726308747:1476390",

"worker_tid":4867}

In order to reproduce this issue I chose the following approach.

Single node (cluster_run) with few vbuckets, e.g. 16, 6GB quota (lots of queueing space)
Add a sleep to the ep-engine flusher (so that we gather lots of data in memory before flushing)
- 20s sleep for example
Run pillowfight 100% write
In a loop run cbstats all

After a few minutes plenty of slow STAT calls were logged and with extra tracing, all showed time was in getAggrDbFileInfo.