Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-38295

Improve Rifts resilience to known edge cases when opening an rolling back to a consistent point

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Done
    • Major
    • 6.6.0
    • 6.6.0
    • tools

    Description

      Rollback in cbbackupmgr is a complicated beast; Rift complicates the issue slightly by having two separate files in which it can store data. In essence, we must ensure the two files are consistent before writing any data them. When Rift was first introduced, we were still using the 'DataBackup' struct for storage coordination; this meant that at the time of creation we didn't know whether we were opening/creating the Rift store (we had to make education assumptions based on the existence/validity of the existing files). This is no longer the case, we are now using the 'VBucketBackupReader' / 'VBucketBackupWriter' structs for storage coordination (as you may have guessed we now know whether or not we are writing/reading data ahead of time).

      What's the problem?
      Although completely code covered, we lack significant unit testing for the Rift consistency check and rollback (e.g. opening existing data stores and moving to a common point). We should utilise the fact that we now know whether or now we are in read only mode to improve Rifts resilience to known edge cases.

      What's going to change?
      1) We will pass an argument to 'NewRiftDB' which will inform it whether we are in read only mode. It will use this to aid in the coordination of the index/data store to better handle known
      edge cases.
      2) A significant amount of testing will be added to validate that Rift behaves as expected in these known edge cases.

      What are these "known edge cases"?
      I'm not going to list them all, partially because I don't know them all of the top of my head...

      We received a sigkill/sigterm or panic and:
      1) The data store is missing some of the reserved versioning space
      2) The index has a user version but is missing the SQLite table or vice versa
      3) We are opening in read only mode and the index or data store doesn't exist
      4) We are opening in write only mode but cbbackupmgr has upgraded (we shouldn't allow resuming writing to a Rift file for which the versions mismatch). In cbbackupmgr we only support creating files of the latest version (but retain the ability read multiple versions). Basically we shouldn't allow users to create Rift stores which contain mixed Rift versions.

      There are several more which I have written unit testing for and they have since slipped my mind but needless to say, there are plenty of scenarios which should be handled correctly.

      How are we going to prove that we are handling these edge cases properly?
      I'm writing this MB retroactively, I've written a significant amount of unit testing for all the different scenarios that I can think of; I might have gone slightly overboard to ensure that Rift is behaving as we expected it to and so that it will continue to do so during active development.

      Why pass an argument to 'NewRiftDB' and new create a RiftReader and a RiftWriter?
      This is the first option that I looked into, however, it's impractical to do so without introducing huge amounts of code duplication. Although it doesn't seem like it from an outside perspective a fairly significant amount of the behavior is common between reading/writing a Rift store. As a quick example, the 'GetInfo' function returns the items/mutations/deletions in the index; this exact functionality is required for read and write (info and stats file creation).

      All this, and it's also easy to forget that we have two other storage formats (SQLite/ForestDB) which I don't want to retroactively modify heavily since they are fine as they are...

      Object Store
      Although "technically" unrelated to this MB the changes that I have/will make for this MB will significantly reduce the "per-vBucket" performance penalty that I've seen during testing. Being able to definitively say that we are in read only/write only mode means we can dramatically reduce the amount of requests we are making to AWS e.g. when opening an in-progress data store, we don't need to check if it exists in the cloud (we know it won't be there) we only need to check for an upload manifest.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            james.lee James Lee
            james.lee James Lee
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty