Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-20052

ForestDB file size is larger than couchstore file size during backup with compression enabled

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • bug-backlog
    • 4.5.0
    • forestdb
    • None
    • Untriaged
    • Unknown

    Description

      The backup client is storing data from backups very similarly to how I think we will be storing data with ep-engine. I have a one or more files with vbucket partitioned between them and when there are multiple files there is a single writer that accesses each file. When I store data I store the key and value and encode the meta data into binary to keep it small. My meta data for example is always 32 bytes.

      I loaded 1M items into the default bucket and also installed the travel-sample bucket. The travel sample bucket ends up being smaller when backed up (with compression) and is 24MB backed up vs. the 109MB on the server. The default bucket on the other hand is 213MB backed up and 139MB on the server.

      After doing some investigation with Sundar it appears that ForestDB is copying some data, I'm presuming the key into the index, but Couchstore is referring to the key from the index. I suspect this possible data duplication is the cause for the difference in file sizes. See below for why I think this is the case.

      Mikes-MacBook-Pro:default mikewied$ ~/couchbase/spock/install/bin/couch_dbinfo 0.couch.3
      DB Info (0.couch.3) - header at 147456
      file format version: 12
      update_seq: 3912
      purge_seq: 0
      crc: CRC-32C
      doc count: 978
      deleted doc count: 0
      data size: 140.1 kB
      B-tree size: 60.6 kB
      total disk size: 144.1 kB

      forestdb_dump /tmp/backup/comp/2016-06-28T16_23_21.108829854-07_00/default-2c3f46599d3d4906ca20baff5a9e7adc/data/shard_0.fdb --header-only

      1. live index nodes: 16419 (67252224 bytes)
        Total document size: 156032078 bytes, (index: 156032078 bytes, WAL: 0 bytes)

      KV store name: partition0

      1. documents in the main index: 978, 0deleted / in WAL: 0 (insert), 0 (remove)
      2. live index nodes: 17 (69632 bytes)
        Total document size: 152468 bytes
        Last sequence number: 3912

      The total size of the ForestDB file is around 213MB and from forestdb_dump we can see that the data and index take up most of this space (156032078 data + 67252224 index = about 213MB). If we look at the couchstore file then we can see for one vbucket the space the index and data consume are different (140.1 kB data + 60.6 kB index = 200.7kB) Notice however that the total size of the file is 144.1 kB. There is clearly some overlap in the counting and I suspect that it is because the couchstore index have references to the keys instead of copies.

      I have a data loader that can be used to reproduce this setup and if needed please let me know and I can push it somewhere, but I think this same thing will happen no matter what the dataset is, as long as it contains 1M keys.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            tai.tran Tai Tran (Inactive)
            mikew Mike Wiederhold [X] (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty