Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42306

Insert-only SyncWrite workload does not correctly trigger auto-compaction

    XMLWordPrintable

    Details

    • Triage:
      Triaged
    • Flagged:
      Release Note
    • Story Points:
      1
    • Is this a Regression?:
      No

      Description

      Summary

      The couchstore fragmentation calculation is not taking into account completed Prepares. As such, auto-compaction is not run when expected, and hence the completed (no longer needed) prepares are not purged.

      Details

      Using pilowfight to load 4M documents into two buckets; one using level=none, one with level=majority (same eviction policy, same compaction threshold of 10%):

      level=none

      cbc-pillowfight -U 127.0.0.1:9000/default -u Administrator -P asdasd --num-items=4000000 -m 4096 -M 4096 --random-body --populate-only --num-threads=10
      

      level=majority

      cbc-pillowfight -U 127.0.0.1:9000/majority -u Administrator -P asdasd --num-items=4000000 -m 4096 -M 4096 --random-body --populate-only --num-threads=20 --durability=majority
      

      (Note I also ran with a reduced vBucket count of 4; so make it easier to load a large number of documents (1M) per vBucket.)

      This in the level=majority load having 2x the disk space:

      • level=none

        $ couch_dbinfo --local default/0.couch.1 
        DB Info (default/0.couch.1) - total disk size: 4.007 GB
           crc: CRC-32C
         
        Header at file offset 4302372864
           file format version: 13
           update_seq: 1000000
           purge_seq: 0
           timestamp: 1970-01-19T14:30:05.060290+01:00
           doc count: 1000000
           deleted doc count: 0
           data size: 3.885 GB
           B-tree size:       58.38 MB
           └── by-id tree:    27.38 MB
           └── by-seqno tree: 31.00 MB
           └── local size:    429 bytes
        

      • level=majority

        $ couch_dbinfo --local majority/0.couch.2 
        DB Info (majority/0.couch.2) - total disk size: 8.017 GB
           crc: CRC-32C
         
        Header at file offset 8608538624
           file format version: 13
           update_seq: 2000000
           purge_seq: 0
           timestamp: 1970-01-19T14:30:05.854591+01:00
           doc count: 2000000
           deleted doc count: 0
           data size: 7.784 GB
           B-tree size:       131.16 MB
           └── by-id tree:    62.12 MB
           └── by-seqno tree: 69.04 MB
           └── local size:    436 bytes
        

      Interestingly the fragmentation percentage (measured as (couch_docs_actual_data_size - couch_docs_data_size) / couch_docs_actual_data_size is around 3%. However, if manually run compaction on the "majority" bucket (via the UI) the disk space shrinks to almost half:

      $ couch_dbinfo --local majority/0.couch.3 
      DB Info (majority/0.couch.3) - total disk size: 3.894 GB
         crc: CRC-32C
       
      Header at file offset 4181381120
         file format version: 13
         update_seq: 2000000
         purge_seq: 0
         timestamp: 1970-01-19T14:30:05.854591+01:00
         doc count: 1000000
         deleted doc count: 0
         data size: 3.894 GB
         B-tree size:       68.07 MB
         └── by-id tree:    33.96 MB
         └── by-seqno tree: 34.11 MB
         └── local size:    434 bytes
      

      What appears to be happening here is that the fragmentation calculation is incorrect - the on-disk Prepares (which have all been committed) are not counted as "overhead", and are instead treated as "valid" documents. This means auto-compaction hasn't run when it would be expected to. When it does run, however, these prepares can all be discarded and hence the file size after compaction is similar to the level=None case.

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          drigby Dave Rigby created issue -
          drigby Dave Rigby made changes -
          Field Original Value New Value
          Link This issue causes CBSE-9118 [ CBSE-9118 ]
          Hide
          drigby Dave Rigby added a comment -

          The reason why this happens is that couch_disk_data_size (size of "valid" data on disk) is calculated directly from couchstore's own count of how much data is in the current B-Tree root.

          However, completed Prepares are still conceptually "valid" data from couchstore's POV - they are just documents with a different key prefix which happen to have a seqno below the high_completed_seqno. As such, couch_disk_data_size includes all prepares, outstanding and completed.

          Addressing this with the current file format is likely to be difficult - the obvious (but expensive) method to accurately measure the size of completed prepares with the current couchstore schema would be to perform B-Tree seqno scan from 0 to the highCompletedSeqno, accumulating the size of all prepares found. However that's an O(N) operation where N = the number of completed prepares; so not really suitable for the ~1s polling which ns_server makes.

          Show
          drigby Dave Rigby added a comment - The reason why this happens is that couch_disk_data_size (size of "valid" data on disk) is calculated directly from couchstore's own count of how much data is in the current B-Tree root. However, completed Prepares are still conceptually "valid" data from couchstore's POV - they are just documents with a different key prefix which happen to have a seqno below the high_completed_seqno . As such, couch_disk_data_size includes all prepares, outstanding and completed. Addressing this with the current file format is likely to be difficult - the obvious (but expensive) method to accurately measure the size of completed prepares with the current couchstore schema would be to perform B-Tree seqno scan from 0 to the highCompletedSeqno, accumulating the size of all prepares found. However that's an O(N) operation where N = the number of completed prepares; so not really suitable for the ~1s polling which ns_server makes.
          drigby Dave Rigby made changes -
          Triage Untriaged [ 10351 ] Triaged [ 10350 ]
          drigby Dave Rigby made changes -
          Is this a Regression? Unknown [ 10452 ] No [ 10451 ]
          drigby Dave Rigby made changes -
          Affects Version/s 6.5.1 [ 16622 ]
          Affects Version/s 6.6.0 [ 16787 ]
          Hide
          drigby Dave Rigby added a comment -

          One possible approach would be to ignore pending prepares entirely, and simply assume that all prepares are completed. This is based on the observation that prepares have a maximum of 65s timeout before they are aborted, and most will be Committed much sooner than that.

          KV-Engine would then subtract this new "total_prepare_size" from couch_disk_data_size to give an approximation of the live data in the vBucket.

          Details

          • Add a new field to vbucket_state - onDiskPrepareBytes: Total number of bytes of all on-disk prepares.
          • This field should be updated as part of commit - in line with where the existing onDiskPrepares field is updated.
          • Use the value of onDiskPrepares when calculating couch_disk_data_size, subtracting it from the raw value read from underlying KVStore.
          • On compaction - if a completed prepare is purged then decrement onDiskPrepareBytes

          Should be relatively simple to implement - expanding existing tracking of the count of Prepares to also track their size. Same logic / updates will be needed.
          Upgrade path shouldn't pose too many problems - if onDiskPrepareBytes doesn't exist then create as zero; when decrementing during compaction then ensure it is clamped at zero.
          Only an estimate, as it assumes all prepares are completed. For datasets which have been in existence for more than few minutes) the percentage of on-disk prepares which are incomplete should be small, so hopefully it will be a reasonably accurate estimate; however there are edge-cases (just after compaction has finished, a large influx of SyncWrites which have not completed yet) the error could be larger.

          Show
          drigby Dave Rigby added a comment - One possible approach would be to ignore pending prepares entirely, and simply assume that all prepares are completed. This is based on the observation that prepares have a maximum of 65s timeout before they are aborted, and most will be Committed much sooner than that. KV-Engine would then subtract this new "total_prepare_size" from couch_disk_data_size to give an approximation of the live data in the vBucket. Details Add a new field to vbucket_state - onDiskPrepareBytes : Total number of bytes of all on-disk prepares. This field should be updated as part of commit - in line with where the existing onDiskPrepares field is updated. Use the value of onDiskPrepares when calculating couch_disk_data_size, subtracting it from the raw value read from underlying KVStore. On compaction - if a completed prepare is purged then decrement onDiskPrepareBytes Should be relatively simple to implement - expanding existing tracking of the count of Prepares to also track their size. Same logic / updates will be needed. Upgrade path shouldn't pose too many problems - if onDiskPrepareBytes doesn't exist then create as zero; when decrementing during compaction then ensure it is clamped at zero. Only an estimate, as it assumes all prepares are completed. For datasets which have been in existence for more than few minutes) the percentage of on-disk prepares which are incomplete should be small, so hopefully it will be a reasonably accurate estimate; however there are edge-cases (just after compaction has finished, a large influx of SyncWrites which have not completed yet) the error could be larger.
          lynn.straus Lynn Straus made changes -
          Fix Version/s 6.6.1 [ 17002 ]
          drigby Dave Rigby made changes -
          Due Date 06/Nov/20
          drigby Dave Rigby made changes -
          Assignee Daniel Owen [ owend ] Dave Rigby [ drigby ]
          drigby Dave Rigby made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          wayne Wayne Siu made changes -
          Labels approved-for-6.6.1
          wayne Wayne Siu made changes -
          Link This issue blocks MB-40528 [ MB-40528 ]
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-6.6.1-9165 contains kv_engine commit e0b181e with commit message:
          MB-42306 [1/2]: Add onDiskPrepareBytes to vbucket_state

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.1-9165 contains kv_engine commit e0b181e with commit message: MB-42306 [1/2] : Add onDiskPrepareBytes to vbucket_state
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-6.6.1-9167 contains couchstore commit 7f8c9b2 with commit message:
          MB-42306: Correctly decode V3 CouchbaseRevMeta

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.1-9167 contains couchstore commit 7f8c9b2 with commit message: MB-42306 : Correctly decode V3 CouchbaseRevMeta
          drigby Dave Rigby made changes -
          Fix Version/s 6.6.2 [ 17103 ]
          Fix Version/s Cheshire-Cat [ 15915 ]
          Resolution Fixed [ 1 ]
          Status In Progress [ 3 ] Resolved [ 5 ]
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-6.6.1-9173 contains kv_engine commit 58937d7 with commit message:
          MB-42306 [2/2]: Bias db_data_size by estimate of completed prepares

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.1-9173 contains kv_engine commit 58937d7 with commit message: MB-42306 [2/2] : Bias db_data_size by estimate of completed prepares
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          Validated the fix on 6.6.1-9173.

          Closing this ticket.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Validated the fix on 6.6.1-9173. Closing this ticket.
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Assignee Dave Rigby [ drigby ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-7.0.0-3841 contains kv_engine commit e0b181e with commit message:
          MB-42306 [1/2]: Add onDiskPrepareBytes to vbucket_state

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-3841 contains kv_engine commit e0b181e with commit message: MB-42306 [1/2] : Add onDiskPrepareBytes to vbucket_state
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-7.0.0-3854 contains couchstore commit 7f8c9b2 with commit message:
          MB-42306: Correctly decode V3 CouchbaseRevMeta

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-3854 contains couchstore commit 7f8c9b2 with commit message: MB-42306 : Correctly decode V3 CouchbaseRevMeta
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-7.0.0-3859 contains kv_engine commit 58937d7 with commit message:
          MB-42306 [2/2]: Bias db_data_size by estimate of completed prepares

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-3859 contains kv_engine commit 58937d7 with commit message: MB-42306 [2/2] : Bias db_data_size by estimate of completed prepares

            People

            Assignee:
            ashwin.govindarajulu Ashwin Govindarajulu
            Reporter:
            drigby Dave Rigby
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Dates

              Due:
              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty