Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-25086

panic in plasma recovery(index out of range) - windows

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Reported by a forum user:

      https://forums.couchbase.com/t/indexer-in-warmup-state-please-retry-the-request-later-from-127-0-0-1-9101/13258

       
      panic: runtime error: index out of range
       
      goroutine 74 [running]:
      panic(0xce6d60, 0xc042018030)
              C:/Users/vagrant/cbdepscache/exploded/amd64/go-1.7.3/go/src/runtime/panic.go:500 +0x1af fp=0xc0420e97a8 sp=0xc0420e9718
      runtime.panicindex()
              C:/Users/vagrant/cbdepscache/exploded/amd64/go-1.7.3/go/src/runtime/panic.go:27 +0x74 fp=0xc0420e97d8 sp=0xc0420e97a8
      github.com/couchbase/nitro/plasma.(*Plasma).doRecovery.func1(0x769115c97, 0xc0458ba000, 0x0, 0xd5c8, 0x0, 0x0, 0x0)
              C:/Jenkins/workspace/couchbase-server-windows/goproj/src/github.com/couchbase/nitro/plasma/plasma.go:420 +0xd7a fp=0xc0420e98e0 sp=0xc0420e97d8
      github.com/couchbase/nitro/plasma.(*lsStore).visitor(0xc0456a6000, 0x7690d1bb5, 0x76913e4d5, 0xc04534cea0, 0xc04534ce80, 0xd29be0, 0xc041e68201)
              C:/Jenkins/workspace/couchbase-server-windows/goproj/src/github.com/couchbase/nitro/plasma/lss.go:297 +0x16e fp=0xc0420e9950 sp=0xc0420e98e0
      github.com/couchbase/nitro/plasma.(*lsStore).Visitor(0xc0456a6000, 0xc04534cea0, 0xc04534ce80, 0xc0458b8000, 0x1000)
              C:/Jenkins/workspace/couchbase-server-windows/goproj/src/github.com/couchbase/nitro/plasma/lss.go:286 +0xa4 fp=0xc0420e99a0 sp=0xc0420e9950
      github.com/couchbase/nitro/plasma.(*Plasma).doRecovery(0xc042184900, 0xc0452fa690, 0x21)
              C:/Jenkins/workspace/couchbase-server-windows/goproj/src/github.com/couchbase/nitro/plasma/plasma.go:486 +0x23c fp=0xc0420e9a68 sp=0xc0420e99a0
      github.com/couchbase/nitro/plasma.New(0x1e, 0x12c, 0x5, 0x4, 0xeb4a20, 0xeb49e8, 0xeb49f0, 0xeb4a30, 0xeb4a38, 0xeb4a28, ...)
              C:/Jenkins/workspace/couchbase-server-windows/goproj/src/github.com/couchbase/nitro/plasma/plasma.go:320 +0xdb3 fp=0xc0420e9e40 sp=0xc0420e9a68
      github.com/couchbase/indexing/secondary/indexer.(*plasmaSlice).initStores.func2(0xc0456a2030, 0xc04532a600, 0xc0456a2020, 0xc04525aa90, 0xc0456a2050)
              C:/Jenkins/workspace/couchbase-server-windows/goproj/src/github.com/couchbase/indexing/secondary/indexer/plasma_slice.go:231 +0x98 fp=0xc0420e9f68 sp=0xc0420e9e40
      runtime.goexit()
              C:/Users/vagrant/cbdepscache/exploded/amd64/go-1.7.3/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc0420e9f70 sp=0xc0420e9f68
      created by github.com/couchbase/indexing/secondary/indexer.(*plasmaSlice).initStores
              C:/Jenkins/workspace/couchbase-server-windows/goproj/src/github.com/couchbase/indexing/secondary/indexer/plasma_slice.go:241 +0x11a7
      
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Assigning to Bala for trying repro the issue

            sarath Sarath Lakshman added a comment - Assigning to Bala for trying repro the issue
            sarath Sarath Lakshman added a comment - - edited

            The crash during the plasma recovery after power-failure is due to data corruption in the underlying data files. Plasma relies on fsync() system call for durability. The operating system asks the device driver to flush all the data. But, some of the disk controllers with write-back internal caching may lie that it has flushed the data to the disk/SSD after writing into the cache. If the disk controllers/RAID are not battery backed it may lead to data corruption. I suspect similar event with this issue as the user reports that corruption does not occur in the event of graceful OS shutdown.

            sarath Sarath Lakshman added a comment - - edited The crash during the plasma recovery after power-failure is due to data corruption in the underlying data files. Plasma relies on fsync() system call for durability. The operating system asks the device driver to flush all the data. But, some of the disk controllers with write-back internal caching may lie that it has flushed the data to the disk/SSD after writing into the cache. If the disk controllers/RAID are not battery backed it may lead to data corruption. I suspect similar event with this issue as the user reports that corruption does not occur in the event of graceful OS shutdown.
            deepkaran.salooja Deepkaran Salooja added a comment - - edited

            Sarath Lakshman, is it possible to detect such a situation and do something similar to what we have for forestdb (if open api returns NO_HEADERS, we consider the file got corrupted and cleanup it up). Right now indexer goes into a crash/recover loop and index cannot be dropped.

            deepkaran.salooja Deepkaran Salooja added a comment - - edited Sarath Lakshman , is it possible to detect such a situation and do something similar to what we have for forestdb (if open api returns NO_HEADERS, we consider the file got corrupted and cleanup it up). Right now indexer goes into a crash/recover loop and index cannot be dropped.

            I am also thinking something similar. We do not have checksums and hence detecting corrupt blocks would be difficult. Even when we add checksums, since we have variable len blocks. The len can still get corrupted. I am thinking of a simple alternative scheme to check the corrupted length. Most of the cases, crash happens when len is a huge int64 number. I am thinking of adding a check for upper bound to check len and detect corruption which may help to avoid panic.

            I am trying to confirm if there is an underlying bug in the persistor and the durability mode we have used in Plasma before making that change.

            sarath Sarath Lakshman added a comment - I am also thinking something similar. We do not have checksums and hence detecting corrupt blocks would be difficult. Even when we add checksums, since we have variable len blocks. The len can still get corrupted. I am thinking of a simple alternative scheme to check the corrupted length. Most of the cases, crash happens when len is a huge int64 number. I am thinking of adding a check for upper bound to check len and detect corruption which may help to avoid panic. I am trying to confirm if there is an underlying bug in the persistor and the durability mode we have used in Plasma before making that change.

            Update:

            1) Run1 : Created a windows 10 64bit VM. When indexes are building, hard rebooted the host machine. We were able to repro the bug. Logs collected and uploaded to apple Bangalore machine.

            2) Run2: Same VM as above, but waited for a little bit more time during index building and rebooted the host machine. We were not able to repro the bug.

            3) Run 3: Used a phsyical Debian 8 Machine. When indexes are building(Constant data being pumped), hard rebooted the host machine. We were not able to repro the bug. Tried this 5 times. Unable to repro the bug.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Update: 1) Run1 : Created a windows 10 64bit VM. When indexes are building, hard rebooted the host machine. We were able to repro the bug. Logs collected and uploaded to apple Bangalore machine. 2) Run2: Same VM as above, but waited for a little bit more time during index building and rebooted the host machine. We were not able to repro the bug. 3) Run 3: Used a phsyical Debian 8 Machine. When indexes are building(Constant data being pumped), hard rebooted the host machine. We were not able to repro the bug. Tried this 5 times. Unable to repro the bug.

            This would be a duplicate or subset of fixes coming in for MB-25440

            jeelan.poola Jeelan Poola added a comment - This would be a duplicate or subset of fixes coming in for  MB-25440

            Bulk closing the invalid bugs (Won't fix, duplicate and User Error). Please feel free to reopen

            raju Raju Suravarjjala added a comment - Bulk closing the invalid bugs (Won't fix, duplicate and User Error). Please feel free to reopen

            People

              sarath Sarath Lakshman
              deepkaran.salooja Deepkaran Salooja
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty