Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-27327

[RocksDB] Space amplification caused by many stale WAL files

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • master
    • master
    • couchbase-bucket
    • None
    • Untriaged
    • Unknown

    Description

      Some recent tests on EP-Engine under RocksDB (tests carried out on the Mancouch server) showed that setting the same Memtable (MT) size for all the three Default, Seqno and Local ColumnFamilies causes relevant Space Amplification.

      The DefaultCF stores the Key-Value pairs, the SeqnoCF stores the Seqno-Key pairs and the LocalCF only the VBState. Thus, each CF actually stores a very different amount of data, resulting in the SeqnoCF being tiny compared to the DefaultCF and the LocalCF being tiny compared to the SeqnoCF (DefaultCF > SeqnoCF > LocalCF). Currently we size the Default MT to a "good" value and then we set the same value for the other CFs.

      As described at https://github.com/facebook/rocksdb/wiki/Column-Families:

      The main idea behind Column Families is that they share the write-ahead log and don't share memtables and table files. By sharing write-ahead logs we get awesome benefit of atomic writes. By separating memtables and table files, we are able to configure column families independently and delete them quickly.
      Every time a single Column Family is flushed, we create a new WAL (write-ahead log). All new writes to all Column Families go to the new WAL. However, we still can't delete the old WAL since it contains live data from other Column Families. We can delete the old WAL only when all Column Families have been flushed and all data contained in that WAL persisted in table files. This created some interesting implementation details and will create interesting tuning requirements. Make sure to tune your RocksDB such that all column families are regularly flushed.

      We currently allow the minimum number of MTs (2 MTs, 1 live and 1 closed) for each CF. When the live MT is filled it is closed. The closed MT is flushed. So, setting the Seqno and Local MTs to the the same size as the Default one causes the following on a usual load scenario:
      1) The Default live MT is quickly filled up and closed, and a new live MT is created.
      2) The live WAL file is closed and a new live WAL is created.
      3) The closed WAL file cannot be deleted because the Seqno and Local live MTs have not been filled, so not closed and not flushed.

      The same repeats until both the Seqno and Local live MTs are filled, which may take a long time for the Local one in particular.

      Results from tests are summarized in 'WAL-size-analysis.png' in attachment. Results show that:
      #1 - The Space Amplification is relevant (final WAL-size is SST-size x2, WAL-size spikes to more than SST-size x3).
      #2 - Setting only the Local MT size to the minimum (64KB) helps for reducing the WAL-size spikes.
      #3 - Setting only the Local MT size to the minimum and flushing it at every BatchWrite helps for reducing the final WAL-size.
      #4 - Setting the Default, Seqno and Local MT sizes proportionally to what each CF is expected to store and flushing the Local MT is the optimum.
      #5 - Sizing all the MTs proportionally but not flushing the Local MT causes the WAL-size to spike again. This is caused by the Local MT to be still too large (64KB) compared to what the Local CF actually stores (only a few bytes for the VBState).

      To address this issue, we set the size of the Local MT to the minimum (MB-27105) and allow to set the size for the Default and Seqno MTs from configuration (MB-27175).
      The third and last step would be flushing the Local MT. For doing that, we could implement a periodic flusher or just flushing at every BatchWrite.
      But, that would trigger compaction more often and increase write amplification. Also, some Perf Tests under Universal Compaction showed that compaction is not always triggered when 'level0_file_num_compaction_trigger' is reached (MB-27308), causing the accumulation of many tiny L0 files and consequent Write Stalling.
      Thus, we decided to merge the Local CF into the Seqno CF (MB-27326). Doing so, we do not need to implement any flush of the Local MT and we prevent Write Stalling. Also, having only two ColumnFamilies simplifies the RocksDB configuration and tuning.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              paolo.cocchi Paolo Cocchi
              paolo.cocchi Paolo Cocchi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty