Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61394

Magma::getStats observed taking a long time

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      Spinning out magma specific issue from MB-61376 (KV will still move stat "all" processing off worker thread)

      Adding some trace spans and trying to force long checkpoints (big flush times) and I see that the "all" stats group is spending most of its time in MagmaKVStore::getAggrDbFileInfo which uses Magma::getStats

      E.g. here's a slow op, i've formatted the trace output and I also (ab)used the some existing trace identifiers just for debug purposes - thus in the following output disassociate_bucket is the wrapped around the call to kvBucket->getFileStats which just gets the file stats via getAggrDbFileInfo

      2024-04-03T11:07:15.136579+01:00 WARNING 282: Slow operation: {"bucket":"default","cid":"[::1]:59885/5c1496f8","command":"STAT","duration":"1476 ms","packet":{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientRequest","opaque":1544853240,"opcode":"STAT","vbucket":0},"peer":{"ip":"::1","port":59885},
       
      "trace":
        "throttled=1108939726321923:38
         sync_write.prepare=1108939726360892:747
         associate_bucket=1108939727109441:8
         disassociate_bucket=1108939727118119:1473652 <-  timing of kvBucket->getFileStats(collector);
         storage_engine_stats=1108941201052251:929
         update_privilege_context=1108939726321847:1475662 <- timing of  EventuallyPersistentEngine::doEngineStatsLowCardinality
         set.with.meta=1108941201985030:542
         get.stats=1108939726317598:1476210
         execute=1108939726308747:1476390
         request=1108939726308747:1476390",
       
      "worker_tid":4867}
      

      In order to reproduce this issue I chose the following approach.

      • Single node (cluster_run) with few vbuckets, e.g. 16, 6GB quota (lots of queueing space)
      • Add a sleep to the ep-engine flusher (so that we gather lots of data in memory before flushing)
        • 20s sleep for example
      • Run pillowfight 100% write
      • In a loop run cbstats all

      After a few minutes plenty of slow STAT calls were logged and with extra tracing, all showed time was in getAggrDbFileInfo.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ankush.sharma Ankush Sharma
              jwalker Jim Walker
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty