Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61394

Magma::getStats observed taking a long time

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      Spinning out magma specific issue from MB-61376 (KV will still move stat "all" processing off worker thread)

      Adding some trace spans and trying to force long checkpoints (big flush times) and I see that the "all" stats group is spending most of its time in MagmaKVStore::getAggrDbFileInfo which uses Magma::getStats

      E.g. here's a slow op, i've formatted the trace output and I also (ab)used the some existing trace identifiers just for debug purposes - thus in the following output disassociate_bucket is the wrapped around the call to kvBucket->getFileStats which just gets the file stats via getAggrDbFileInfo

      2024-04-03T11:07:15.136579+01:00 WARNING 282: Slow operation: {"bucket":"default","cid":"[::1]:59885/5c1496f8","command":"STAT","duration":"1476 ms","packet":{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientRequest","opaque":1544853240,"opcode":"STAT","vbucket":0},"peer":{"ip":"::1","port":59885},
       
      "trace":
        "throttled=1108939726321923:38
         sync_write.prepare=1108939726360892:747
         associate_bucket=1108939727109441:8
         disassociate_bucket=1108939727118119:1473652 <-  timing of kvBucket->getFileStats(collector);
         storage_engine_stats=1108941201052251:929
         update_privilege_context=1108939726321847:1475662 <- timing of  EventuallyPersistentEngine::doEngineStatsLowCardinality
         set.with.meta=1108941201985030:542
         get.stats=1108939726317598:1476210
         execute=1108939726308747:1476390
         request=1108939726308747:1476390",
       
      "worker_tid":4867}
      

      In order to reproduce this issue I chose the following approach.

      • Single node (cluster_run) with few vbuckets, e.g. 16, 6GB quota (lots of queueing space)
      • Add a sleep to the ep-engine flusher (so that we gather lots of data in memory before flushing)
        • 20s sleep for example
      • Run pillowfight 100% write
      • In a loop run cbstats all

      After a few minutes plenty of slow STAT calls were logged and with extra tracing, all showed time was in getAggrDbFileInfo.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-61394
          # Subject Branch Project Status CR V

          Activity

            People

              ankush.sharma Ankush Sharma
              jwalker Jim Walker
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty