Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-11782

Adding Nodes To A Cluster Can Result In Reduced Active Residency Percentages

    XMLWordPrintable

Details

    Description

      Customer added 6 nodes to a large cluster in an attempt to increase the overall percentage of active bucket data in cache, and observed that the active bucket residency decreased after rebalancing. Decrease in active data residency after adding nodes and rebalancing turns out to be reproducible.

      To reproduce this anomaly, create an 8-node cluster with RAM quota of 100Mbytes per node and populate the default bucket until active percent in memory is about 40%. (I used cbWorkLoadGen and inserted 300K items into the default bucket, specifying an item size of 2K bytes and enabling the -j (JSON) option.) Add 3 nodes to this cluster and rebalance. The resulting default active memory residency percentage will drop significantly and the replica residency percentage will increase. Note that if 3 random nodes are then removed and rebalanced and then added back and rebalanced again, active residency will increase beyond the initial level.

      The critical factor in reproducing this anomaly is that the bucket data size must exceed its RAM quota such that the majority of bucket data resides on disk at any given time. When nodes are added to the cluster, the subsequent rebalance results in entire vbuckets read from disk on 1 node and dumped to cache on the receiving node via TAP protocol. Eventually, the node high-water mark will be exceeded and ejections occur. What is consistently observable is that active ejections occur at a greater rate than replica ejections and results in a decreased active bucket residency percentage and an increased replica bucket residency percentage.

      Possible workarounds include adding/rebalancing nodes in stages, e.g., instead of adding 6 nodes to a cluster at once, add 3 nodes, rebalance than add 3 more nodes and rebalance again. A 2nd potential workaround would be to alter the default ejection probabilities for replica and active data to reduce the probability of ejecting active data and increase the probability of ejecting replica data. I have not had time to test these possible workarounds.

      After discussion in the Support group, our thinking is that any configuration change which is enabled with the intention of improving performance should not result in worsened performance, but that is what can happen in this case. Accordingly we believe that this is a bug and that the rebalancing algorithm should be examined to figure out why - under certain circumstances - rebalancing can cause a higher probability of active data to be ejected .

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              dhaikney David Haikney (Inactive)
              morrie Morrie Schreibman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty