Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-38482

Zero-sized couchstore files found after power outage

    XMLWordPrintable

Details

    • Bug
    • Resolution: User Error
    • Critical
    • None
    • 5.0.1
    • couchbase-bucket
    • None
    • ext4, default mount flags
    • Triaged
    • Centos 64-bit
    • No

    Description

      Scenario

      Power outage, restart.
      The monitoring system showed a drop in bucket items.
      Some 25% of the data were completely lost.
      We've recovered that 25 % by hand and forgot about the problem.

      After 2 days we needed to manually restart the same node.
      Official restart, no crashes.
      And again we saw about 20% data completely lost. Just drop on the graph of items in a bucket of Couchbase type.

      Careful investigation showed zero-sized files next to files with content:

      root@c5sdp5:/mnt/data/couchbase/DynamicProfile# ls -la 224.couch.*
      -rw-rw---- 1 couchbase couchbase        0 мар 24 15:49 224.couch.238758
      -rw-rw---- 1 couchbase couchbase 10797147 мар 26 23:43 224.couch.281
      root@c5sdp5:/mnt/data/couchbase/DynamicProfile#
      

      Note that real content got low version number, and zero-sized got high version number.

      On a test system we've successed to reproduce the problem:

      for  i in {0..1024} ; do touch $i.couch.333333 ; chown couchbase:couchbase  $i.couch.333333; done
      

      Restart with such empty files added to the data folder with good files causes Couchbase to have 0 items in a bucket.
      Obviously, Couchbase loads only the latest version of a file.

      Our analysis shows the ext4 default approach to metadata and file data is somewhat strange.
      Linus claims it

      But ext4 is a popular file system and Couchbase should run reasonably well there.

      What happened on the first restart after a power outage was:
      1. ext4 does metadata flush every 5 seconds and data flush only every 30 seconds by default
      2. power outage caused metadata flushed: new version of vbucket had no data flushed yet. The old version of the vbucket got removed.
      3. after power outage reboot found situation: just one empty file for vbucket. Couchbase ignored it and started a new life with version=1.
      4. Couchbase kept the zero-sized file on disk. Putting a ticking bomb under subsequent restart.

      I'll list problems I see here in the description:

      Results

      1. Couchbase can lose data after a sudden system restart even when a previous version of data is there.
      2. Couchbase probably can do steps to ensure file system would have the latest version of a just-compacted file but does not (we don't see fsync/fdatasync in code)
      3. When Couchbase starts with one of the vbucket files of zero size and there is no other file versions, Couchbase starts a new life for that vbucket with version=1 and keeps the junk zero-size file intact. To keep the junk is wrong (see below). Maybe even to start at all is wrong (needs discussion).
      4. when Couchbase starts and the file system has this situation:

      root@c5sdp5:/mnt/data/couchbase/DynamicProfile# ls -la 224.couch.*
      -rw-rw---- 1 couchbase couchbase        0 мар 24 15:49 224.couch.238758
      -rw-rw---- 1 couchbase couchbase 10797147 мар 26 23:43 224.couch.281
      root@c5sdp5:/mnt/data/couchbase/DynamicProfile#
      

      Couchbase selects file with the highest version (.238758 in this example) and uses only that file (causing silent data loss).
      We feel this is wrong to use empty files and ignore files with at least a previous version of that vbucket.
      5. Couchbase writes nothing to log file about 3 or 4 situations (at least in UI we see that "bucket loaded" and no notes that some files were zero-size, obviously critical noteworthy event!)

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              drigby Dave Rigby (Inactive)
              paf Alexander Petrossian (PAF)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty