Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39422

Reduce front-end thread costs when we hit the high watermark (memoryCondition)

    XMLWordPrintable

Details

    Description

      As observed when testing data load / rebalance workloads under high DGM, when KV-Engine hits the high watermark, a large amount of work is done on the front-end thread.

      Specifically, we see that 60% of each active front-end thread is spent inside memoryCondition(), which just decides if memory recovery should be attempted or not (it doesn't actually recover any memory):

      Note that's every front-end thread is wasting 60% of it's time; across all Front-end threads it's larger.

      Analysis of the profile highlights a number of issues:

      1. Excessive time spent in VBucketCountVisitor::visitBucket. 98% of all time in memoryCondition is spent in VBucketCountVisitor::visitBucket. This is called to calculate the number of resident items, and if non-zero them the ItemPager is woken up. However VBucketCountVisitor::visitBucket actually accumulates ~50 or so stats, many of which are more expensive than the item counts, and so there's ~48 stats which are calculated and them simply ignored.
      2. Excessive calls to memoryCondition(). Every time a client operation fails due to not enough memory being available (i.e. at/above high watermark), memoryCondition is called, and perform the above expensive checks. This is very wasteful because:
        1. The ItemPager could already be running, and it cannot be re-scheduled again until it's finished anyway.
        2. Another client thread could already be running memoryCondition.
      3. memoryCondition is arguably over-complex in what it's trying to do.
        1. Firstly, it's essentially doing two things at the same time - determine if we should return ETMPFAIL or NOMEM to the user, and secondly attempt to recover memory if possible.
        2. Secondly, the memory recovery logic is complex / brittle - we attempt to predict if memory could be recovered ahead of time, with two possible approaches - paging out items, or closing unreferenced Checkpoints. However, I suspect the prediction isn't always correct in deterring if any more memory can be freed, as it relies on indirect metrics like number of items resident), which could already be zero, and memory is in use elsewhere.
          See also MB-22523 which has some relevant commentary on why the current design is the way it is, which is at least partly due to trying to do minimal fixes for an issue late in the development cycle of 5.0.0

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              drigby Dave Rigby (Inactive)
              drigby Dave Rigby (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty