Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61641

Append corrupts log when disk errors are present

Details

    • Untriaged
    • 0
    • Unknown
    • Plasma-April-29-2024, Plasma-May-6-2024

    Description

      The error occurs when attempting to write a block that is larger than the remaining size of the segment.
      1. If the first write and the growLog call were successful, a new segment was added.
      2. The attempt to write to this new segment failed.
       
      The whole Append operation is failed. Then we have the tailOffset remains the same and pointing to the previous segment, even though a new segment has already been created and added. When a retry occurs, it incorrectly writes to the new segment as though it were the previous one, leading to corruption of the log.
       
      This issue is caused by the assumption that the tailOffset always points to the last segment.
       
      The problem was simulated in GSI test with 90% of write errors injected. The recoveryCleaner panics due to checksum failure.
       

      panic: LSS /tmp/gsi_testdir/devel_0_0/docIndex/recovery(shard16603717930189774736) : fatal: LSS Block is corrupted
       
      goroutine 227 [running]:
      github.com/couchbase/plasma.(*Shard).raisePanic(...)
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/shard.go:1277
      github.com/couchbase/plasma.(*LSSCtx).raiseCorrupted(0xc00041c3c0, {0x100b6c9e0?, 0xc0003ea190})
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/lssctx.go:202 +0x9d
      github.com/couchbase/plasma.(*RecoveryCleaner).CleanLSS.func3(0x1, 0xd?, {0x100b6c9e0?, 0xc0003ea190})
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/recovery_cleaner.go:315 +0xd9
      github.com/couchbase/plasma.(*RecoveryCleaner).CleanLSS.func6(0x1fffff72, 0x1fffff76?, {0x134ae6f76, 0x2042, 0x2a08a}, 0x0)
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/recovery_cleaner.go:396 +0x15b7
      github.com/couchbase/plasma.(*lsStore).RunCleaner.func1(0xc00046e7e0?, {0x134ae6f76?, 0xc001194880?, 0x100012d48?}, 0x101a4e108?)
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/lss.go:616 +0x47
      github.com/couchbase/plasma.(*lsStore).Visitor2(0xc00053fb08?, 0xc0634be8c2?, 0x20711000, 0xc00053fb38, 0xc001194880)
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/lss.go:652 +0x11c
      github.com/couchbase/plasma.(*lsStore).RunCleaner(0xc00046e7e0, 0xc0017fc370, 0xc00053fd08?)
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/lss.go:630 +0x125
      github.com/couchbase/plasma.(*LSSCtx).LSSRunCleaner(...)
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/lssctx.go:1016
      github.com/couchbase/plasma.(*RecoveryCleaner).CleanLSS(0xc00061a100, 0xc0000b9f00)
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/recovery_cleaner.go:526 +0x84e
      github.com/couchbase/plasma.(*RecoveryCleaner).lssCleanerDaemon(0xc00061a100)
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/recovery_cleaner.go:651 +0x22e
      created by github.com/couchbase/plasma.(*RecoveryCleaner).Start
      	/Users/rvncerr/couchbase/goproj/src/github.com/couchbase/plasma/recovery_cleaner.go:755 +0x5a
      

       
      Head and tail offsets of corrupted block were additionally printed:
       

      RECOVERY CLEANER 536870770 536879032
      

       
      This block was located across two segments.
       

      16:27:17 plasma % ls -la /tmp/gsi_testdir/devel_0_0/docIndex/recovery
      total 1132512
      drwxr-xr-x  6 rvncerr  wheel        192 Apr 23 15:47 .
      drwxr-xr-x  6 rvncerr  wheel        192 Apr 23 15:30 ..
      -rwxr-xr-x  1 rvncerr  wheel        107 Apr 23 15:47 error
      -rwxr-xr-x  1 rvncerr  wheel       8192 Apr 23 15:47 header.data
      -rwxr-xr-x  1 rvncerr  wheel  536870912 Apr 23 15:41 log.00000000000000.data
      -rwxr-xr-x  1 rvncerr  wheel  536965120 Apr 23 15:47 log.00000000000001.data
      

      The issue is reproducible in the following unit-test.

      func testLogWriteFailAcrossFiles(t *testing.T, path string) {
          // blockSize < segmentSize < 2*blockSize
          segmentSize := int64(1024)
          blockSize := int64(1008)
       
          Env = env.NewUnstableEnvironment(0, 0, 100)
          defer func() {
             Env = &env.DefaultEnvironment
          }()
       
          l, _ := newLog(path, segmentSize, syncMode, ioMode, true, env.DefaultEnvironment)
       
          makeBlock := func(sz int64, val byte) []byte {
             rv := make([]byte, sz)
             for i := range rv {
                rv[i] = val
             }
             return rv
          }
          checkBlock := func(bs []byte, val byte) bool {
             for i := range bs {
                if bs[i] != val {
                   return false
                }
             }
             return true
          }
       
          bs0 := makeBlock(blockSize, 0)
          bs1 := makeBlock(blockSize, 1)
       
          Env.SetErrorRateW(99)
          for {
             err := l.Append(bs0)
             if err != nil {
                fmt.Printf("bs0 append err : %v\n", err)
             } else {
                break
             }
          }
          Env.SetErrorRateW(0)
          l.Commit()
       
          Env.SetErrorRateW(99)
          for {
             err := l.Append(bs1)
             if err != nil {
                fmt.Printf("bs1 append err : %v\n", err)
             } else {
                break
             }
          }
          Env.SetErrorRateW(0)
          l.Commit()
       
          rbs0 := make([]byte, blockSize)
          rbs1 := make([]byte, blockSize)
          l.Read(rbs0, 0)
          l.Read(rbs1, blockSize)
       
          if !checkBlock(rbs0, 0) {
             t.Fatalf("rbs0 is corrupted : %v", rbs0)
          }
          if !checkBlock(rbs1, 1) {
             t.Fatalf("rbs1 is corrupted : %v", rbs1)
          }
       
          l.Close()
      }  

      Attachments

        For Gerrit Dashboard: MB-61641
        # Subject Branch Project Status CR V

        Activity

          People

            Dmitriy.Kalugin-Balashov Dmitriy Kalugin-Balashov
            Dmitriy.Kalugin-Balashov Dmitriy Kalugin-Balashov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There is 1 open Gerrit change

                PagerDuty