Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57400

ActiveStream::getItemsRemaining when many items are held in checkpoints can cause KV gridlock.

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown
    • KV 2023-4

    Description

      If checkpoints are able to grow - e.g. slow DCP client and plentiful RAM calling getNumItemsForCursor becomes noticeable slow, the function is O(n) n = items.

      This function is called from many stat paths and in 7.x periodically called from the prometheus gathering (civetweb threads).

      The function call itself takes the CheckpointManager::queueLock, which is required by many paths of KV, queueing new items, querying high-seqno and so on.

      On a MacBook (2.6 GHz 6-Core Intel Core i7) some rough benchmarking shows that when 300k items are stored, the function takes 20ms and when combined with many cursors and many vbuckets all in a similar state, the prometheus scrape takes a significant amount of time - impacting many parts of KV as noted by MB-57296 and MB-56891 (see all linked issues).

      This ticket tracks any solution we may wish to apply in neo.

      Note that the issue here is not expected to be a problem in master branch because the function is O(1) as per this change - https://review.couchbase.org/c/kv_engine/+/179571

       

      Issue Resolution
      The computation count for the items remaining DCP/Checkpoint stats exposed to Prometheus was the O(N) function. Where N is the number of items in a checkpoint. This caused various performance issues including Prometheus stats timeouts when checkpoints accumulated a high number of items. The computation count has been optimized and now is O(1).

      How to check for this issue

      If both of the following symptoms are seen in the same time period (on an affected version) then you have likely hit this issue:

      1. Persistent timeout of prometheus metric requests on the KV endpoint (note this will typically result in gaps in the KV-Engine prometheus metrics for the affected time period):

        $ grep Scrape ns_server.prometheus.log | grep 'scrape_pool=kv_high.*context deadline exceeded'
        level=debug ts=2023-11-21T06:22:02.161Z caller=scrape.go:1129 component="scrape manager" scrape_pool=kv_high_cardinality target="http://127.0.0.1:11280/_prometheusMetricsHigh?timeout=10000" msg="Scrape failed" err="Get \"http://127.0.0.1:11280/_prometheusMetricsHigh?timeout=10000\": context deadline exceeded"
        

      2. KV-Engine reports clock jumps (seen due to internal contention to schedule the clock check thread):

        $ grep "WARNING system clock changed" memcached.log
        2023-11-21T06:22:21.342070+00:00 WARNING system clock changed? Expected delta of 60s ±1 since last check, actual difference = 46s, memcached_epoch = 1700348601, memcached_uptime = 199140, new memcached_epoch = 1700348601, next check 199200
        

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              jwalker Jim Walker
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty