Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45622

[CBM] Improve resilience to power outage/kill -9 scenarios

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • feature-backlog
    • 6.0.0, 6.0.1, 6.0.2, 6.0.3, 6.0.4, 6.0.5, 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, 7.0.0, 7.0.1, 7.1.0
    • tools
    • None
    • 1

    Description

      What's the issue?
      Although we do have logic to handle unexpected failures such as power outages and a 'kill -9', it's currently built of unsafe assumptions, these assumptions are:

      1. That SQLite will be able to recover from a power outage (by default, this should be the case, however, we disable journals and syncing)
      2. The the 'RiftBufferedWriter' will be able to recover from a power outage (we use the 'sync_file_range' syscall as a performance optimization; this isn't safe on some filesystems as file metadata will not be written out as it would with 'fdatasync').

      Example
      Some of our testing will prematurely 'kill -9' 'cbbackupmgr' in an effort to test resume support, this has lead to situations where an invalid/corrupt SQLite file is detected. See this case where one of our tests has failed due to a 'database disk image malformed'.

      What's the fix?
      Ideally, we should better handle these situations where possible:

      1. Move away from using the truncate-overwrite pattern (MB-46878)
      2. Enable syncing for SQLite (potentially enable journals, although we need to consider that not all journal types are supported on NFS)
      3. Periodically sync using 'fsync' or 'fdatasync' in the 'RiftBufferedWriter'

      Attachments

        1. cbbackupmgr-collectinfo-backups-2021-04-13T082352.zip
          142 kB
        2. collectinfo-2021-04-13T083120-n_0@cb.local.zip
          24.73 MB
        3. collectinfo-2021-04-13T083943-n_0@cb.local.zip
          9.96 MB
        4. log.html
          4.11 MB
        5. output.xml
          24.57 MB
        6. report.html
          245 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              james.lee James Lee
              james.lee James Lee
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty