Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50468

Utilise sync_file_range() for couchstore periodic fsync to improve throughput

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 7.1.0
    • couchbase-bucket
    • None

    Description

      Background

      As seen in MB-50389, reducing the interval between periodic disk syncs during compaction has a significant improvement in tail latencies of disk write (and read) operations - in that MB reducing the interval from 16MB to 1MB resulted in a over a 7x reduction in p99.9 write latencies.

      That MB also showed that even more frequent fsyncs (tested 256KB and 64KB) continued to improve tail latency, albeit at a smaller magnitude than compared with the 16MB -> 1MB change:

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_710-1885_access_17ce&snapshot=ares_710-2021_access_c53e&snapshot=ares_710-2021_access_015d&snapshot=ares_710-2021_access_29ee&snapshot=ares_710-2021_access_5e4e&snapshot=ares_710-2021_access_53a5&label=7.1.0-1884%20(compactors:1)&label=7.1.0-2021%20(compactors:4)&label=fsync:4MB&label=fsync:1MB&label=fsync:256KB&label=fsync:64KB

      However, reducing sync interval to 256KB starts to have a non-negligible impact on compaction throughput - again from the same MB we saw the following runtimes to compact a given bucket at different sync intervals:

      sync interval Compaction runtime (s) Throughput vs 16MB
      16MB (default) 88.2 1.0x
      1MB 91.6 1.03x
      256 KB 102.3 1.16x

      As such, current plan for MB-50389 is to reduce fsync down to 1MB, but no lower; to maintain compaction throughput.

      fdatasync

      Recall the main purpose of issuing periodic fdatasync is to avoid large amounts of modified buffer cache pages accumulating, and then eventually getting flushed to the medium at once, resulting in a large queue of outstanding requests and hence interferring with the (typically) smaller but latency sensitive reads/writes issued needed by front-end operations (BGFetches / SyncWrite flushes).

      fdatasync achieves that, but not without a certain amount of overhead - datasync also performs the following operations which we don't actually need and can be costly:

      • Blocks waiting for the the outstanding writes to be flushed from filesystem down to the disk
      • Blocks waiting for the the disk's write cache to be flushed to the medium
      • Blocks waiting for any relevant file metadata to be flushed to the medium

      All we really want to do is to avoid the accumulation of modified buffer cache pages past some "reasonable" value so we see a "slow and steady" stream of writes to disk during compaction - enter file_sync_range...

      file_sync_range

      The Linux kernel supports an additional API to sync a file's state to disk - file_file_range man page. This allows more fine-grained control of how a file is flushed to disk, compared to the currently-used fdatasync:

      1. Only a subset of the file (offset + size) can be flushed
      2. Does not require the files's metadata is written to disk
      3. Does not requires the mediums's write cache is flushed.
      4. Specify whether the call is synchronous (blocking) or asynchronous (non-blocking)

      (1) is of little interest to us given Couchstore always appends to files, however (2), (3) and (4) are of interest:

      • We don't need metadata to be consistent until the entire file is written, doing it every 1MB (or less) is pointless
      • We don't need the disk's write cache to be flushed (assuming the disk can handle having cached writes outstanding and still service other read/writes)
      • We don't need to actually block waiting for modified pages to be written, we just want to initiate their write.

      In theory we could recover some of the lost throughout from reducing the sync interval by using file_sync_range instead - the expectation being that if we can avoid the blocking wait on the sync, then we can pipeline userspace compacted file building and kernel-space writing of that data out; and hopefully can maintain similar compaction throughput as seen before.

      We might even see better throughput (at the same sync interval), given we are not forcing writes down to the medium, and are letting the disk manage that itself; additionally we are not updating the filesystem metadata on every sync.

      For reference, a number of other DBs are using file_sync_range in this manner - and both use a default of syncing every 1MB written:

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              owend Daniel Owen
              drigby Dave Rigby (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty