Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-27420

CouchRocks: Investigate using block sizes larger than 4K

    XMLWordPrintable

Details

    • Task
    • Resolution: Done
    • Major
    • master
    • master
    • couchbase-bucket
    • None

    Description

      There is evidence that the default data block size of 4K for RocksDB SST files is sub-optimal; and we should investigate increasing it to improve both our space efficiency and buffer cache usage.

      Background

      RocksDB's SST files are block-based; assembling 1 block's worth of data; compressing it and then writing to the SST file - see https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format. There is a trade-off in selecting data block size - larger data block sizes:

      1. Give the compression algorithm a larger corpus to work with (and hence potentially allow greater compression ratio).
      2. Reduce the size of the index blocks (index block contains one element per data block) - which reduces the memory cost of searching SST files.
      3. Increase the minimum data which must be read from disk - RocksDB always reads whole data blocks from disk
      4. Increases the buffer cache footprint - as a complete block is read from disk; the buffer cache will need to contain the whole of that block even if only a subset of it was needed.

      Smaller block sizes essentially have the inverse trade-offs.

      Given the OS / filesystem will have it's own page size; there's essentially a lower bound on how small we want the data blocks to be when written to disk - i.e. post compression. If the post-compression data block size is smaller than the OS page size, then we are essentially wasting the that space in buffer cache.

      As such, we (in theory) want to ensure that our post-compression data blocks are a minimum of 4K. The above is all based on comments by Mark Callaghan on tuning RocksDB when comparing with ForestDB

      set block_size to 8kb. The default is 4kb before compression and I don't want to do ~2kb disk reads. Nor do I want a block with only 4 1kb docs as that means the block index is too large.

      Note we can calculate the compression ratio (and average block size) for SST file writes from the RocksDB LOG file:

      grep table_file_creation kv_rocks.log|cut -d ' ' -f 4- | jq -c '[.cf_name, (.table_properties.raw_key_size + .table_properties.raw_value_size) / .table_properties.data_size, .table_properties.data_size / .table_properties.num_data_blocks ]'
      

      This prints the CF name, compression ratio and average block size for every SST written. First 10 files from: http://perf.jenkins.couchbase.com/job/hera-pl/38/:

      ["local+seqno_18",1.2647498223259919,3024.95818815331]
      ["local+seqno_30",1.2644876339261668,3029.8229166666665]
      ["local+seqno_2",1.2645480273053071,3032.4285714285716]
      ["local+seqno_6",1.2654021167896579,3023.446366782007]
      ["local+seqno_22",1.2642028458614438,3028.543554006969]
      ["local+seqno_10",1.2648256637731112,3022.9545454545455]
      ["local+seqno_26",1.264179451150661,3030.2229965156794]
      ["local+seqno_14",1.264774626291398,3023.1423611111113]
      ["default_18",1.7220551898182483,2495.6929824561403]
      ["default_30",1.7235152429535299,2496.4398595259]
      

      If we look at the minimum and maximum average block size, we see the they range from 2467B to 3033B - well below 4K:

      $ grep table_file_creation kv_rocks.log|cut -d ' ' -f 4- | jq -c '.table_properties.data_size / .table_properties.num_data_blocks ' | sort -n | head -n1
      2467.423140556638
      $ grep table_file_creation kv_rocks.log|cut -d ' ' -f 4- | jq -c '.table_properties.data_size / .table_properties.num_data_blocks ' | sort -rn | head -n1
      3033.6881720430106
      

      Attachments

        For Gerrit Dashboard: MB-27420
        # Subject Branch Project Status CR V

        Activity

          People

            paolo.cocchi Paolo Cocchi
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty