Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-25207

Explicitly control disk write amplification

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 5.0.0
    • couchbase-bucket

    Description

      Summary

      As users move to faster and faster disk subsystems (SSD, NVMe), and wider machines (more threads for writing) we are finding that the existing mechanisms in KV-Engine to control write batch size no longer work efficiency. We should look at ways to give users more direct, explicit control of the write amplification they see, by allowing them to choose between write amplification and persistTo times.

      The proposal is to add a new bucket setting - Minimum Persist Latency. This specifies the minimum duration a user wants to wait before an item is persisted to disk.

      This value is per-bucket and dynamically tuneable; allowing users to early experiment with the best tradeoff of persistTo times (how much outstanding changes they are willing to discard in the event of a node crash without failover) vs Write Amplification.

      Background

      As a motivating example, see some of the real-world data from MB-24692 - the write batch sizes seen for a AWS instance with NVMe storage:

      rw_1:bulkSize (797062755 total)
          1 - 2         : ( 99.00%) 789081709 #####################################
          2 - 4         : ( 99.80%)   6356146 
          4 - 8         : ( 99.94%)   1122551 
          8 - 16        : ( 99.97%)    255458 
          16 - 32       : ( 99.98%)     94501 
          32 - 64       : ( 99.99%)     45240 
          64 - 128      : ( 99.99%)     42296 
          128 - 256     : ( 99.99%)     23145 
          Avg           : (      1)
      

      The average and p95 sizes are only one item; p99 is only 2 and p99.99 is just 64. As a consequence, the overhead to perform a disk commit (write a new B-tree root + intermediate nodes) is very high compared to the user data written. This leads to high disk write amplification, the consequence of which is disks wearing out faster than desired, which has a real monetary effect on users. For example, one customer in AWS is forced to perform a complete cluster rebalance periodically to "rebalance out" nodes whose disks have become too worn out.

      TODO: Add more specific write application figures.

      Current Implementation

      The reason for the current behaviour is that the Flusher tasks currently operate in a "greedy" fashion: they check all vBuckets in a round-robin fashion, and will perform a flush as soon as a non-zero number of updates are outstanding for any vBucket.

      Batching is achieved impliclty, by virtue of a given Flusher being responsible for multiple vBuckets and working serially in sequence - the user has no way to directly select the batching they desire.

      The current the high level algorithm for the Flusher is:

      while (true) {
          for (vbucket : vBucketsInThisShard) {
              if (vbucket.numOutstandingUpdates() > 0 ) {
                  flushToDiskAndFSync(vbucket);
              }
          }
          if (shard.noOutStandingUpdates() {
              snoozeFlusherTaskUntilNewUpdate();
          }
      }
      

      Assuming all vBuckets start out with 1 outstanding update each, batching is expected to occur due to the Flusher processing vBuckets sequentially. During the time the 1st vBucket is checked, and flushing begins, incoming front-end updates will accumulate on the other vBuckets. by the time the 1st vBukcket completes, there is say 2 updates waiting for the 2nd vBucket, and hence it will take longer (but probably not 2x) to flush 2 updates.

      This will continue for all the vBuckets, until we are back at the start and the 1st vBucket has accumulated >1 updates, and hence updates are batched.

      The issue with this implementation is that the batching is essentially a second-order effect - it is a function of how (a) quickly updates occur on the front-end vs (b) how quickly we can flush those updates to disk.

      Either one of those factors will change the batch sizes (sometimes significantly) - increasing front-end workload (in a busy period) will increase batching, but lower workload (quiet period, or rebalancing in new nodes to support a larger dataset) will decrease batching.

      Similary, adding faster, newer disks will increase the speed of the flusher (and decrease batching), whereas reducing the IO capacity (fewer writer threads, slower disks) will increase batching.

      Not only are some of these effects counter-intuitive, but they are not easily controllable by the user - if the front-end operations decrease on a cluster (say adding more nodes to handle increasing dataset size), it's not easy for the administrator to deal with the decreased batching - even if they do change the number of writer threads that's a pretty blunt instrument - and isn't very helpful for a variable workload over a day cycle.

      Proposal

      We add a new bucket setting - Minimum Persist Latency[1]. This specifies the minimum duration a user wants to wait before an item is persisted to disk. A user can select a non-zero value (e.g. 1ms) to inform us that they are willing to wait for updates to accumulate before they are written to disk in a larger batch. This results in reduced write amplification.

      This value is per-bucket and dynamically tuneable; allowing users to early experiment with the best tradeoff of persistTo times (how much outstanding changes they are willing to discard in the event of a node crash without failover) vs Write Amplication.

      This could be implemented by adjusting the high level algorithm above to timestamp when the first update is added to a vBucket (i.e. when the number of outstanding updates moves from 0 -> 1). When checking if a bucket is to be flushed, we additionally check if the specified min_persist_latency has elapsed.

      ([1]: Q: Why not specify the minimum batch size directly? A: The problem with specifying a min batch size >1 is that if a vBucket receives an update and then doesn't receive a second update for an extended period (e.g 1s), then the first update will not be persisted until the batch "fills up". This is essentially the same as TCP_NODELAY - for latency-sensitive applications a deadline is needed after which time work is done irrespective of whether the batch is full.)

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            owend Daniel Owen
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty