Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-26705

DGM Rebalance (Add Node) Causes Client TMP_OOM errors



    • Untriaged
    • Unknown


      Steps to Reproduce

      • Cluster of 4 nodes with reasonable spec (>=8 cores, SSD, fast network)
      • Create one large (>10GB) bucket with 10% resident ratio
      • client workload (e.g. pillowfight) doing ~20K writes / sec
      • Rebalance in a 5th node to the cluster

      The above steps can result in the client receiving a significant spike in TMP_OOMs during the rebalance (from the incoming node) and degrade the application's performance. The suspected cause is that DCP replication streams from the existing nodes can quickly saturate the memory on the incoming node. The item pager is either not successfully invoked OR cannot eject items quickly enough - conjecture is it may require several passes to get an item with a sufficient LRU value to eject.

      The desired behaviour is that the client application is basically unaffected by the rebalance. This could possibly be achieved in a number of ways. The following are merely suggestions to get the ball rolling:

      • Change the relative priority of the ItemPager and DCP Processor tasks (currently the processor is higher priority).
      • Run the item pager more aggressively - note it is not currently triggered by SET_WITH_META (which DCP consumer uses).
      • Initialise the items on the incoming node with a different LRU value that allows them to be ejected on first pass of the item pager.
      • Incorporate a more sophisticated throttle / backoff on the DCP stream when the HWM is reached so that frontend client ops have greater priority.


        1. after.png
          436 kB
        2. before.png
          81 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.



              bharath.gp Bharath G P
              dhaikney David Haikney (Inactive)
              0 Vote for this issue
              5 Start watching this issue