Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-29764

Indexer crashes with goroutine stack exceeds 1000000000-byte limit

    XMLWordPrintable

Details

    • Triaged
    • Unknown
    • Storage-Sprint-End-Jun-1-2018, Storage-Sprint-End-Jun-15-2018

    Description

      Indexer process goroutine stack exceeds 1000000000-byte limit fatal error: stack overflow and cause the indexer to crash.

      This is the stack

      StorageMgr::handleCreateSnapshot Added New Snapshot Index: 2954511192241179090 PartitionId: 0 SliceId: 0 Crc64: 3092221115143794419 (SnapshotInfo: count:10889206 committed:false) SnapCreateDur 62.255µs SnapOpenDur 1.078635ms
      runtime: goroutine stack exceeds 1000000000-byte limit
      fatal error: stack overflow
      runtime stack:
      runtime.throw(0xe730bc, 0xe)
      /home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/panic.go:566 +0x95 fp=0x7f317bffeb88 sp=0x7f317bffeb68
      runtime.newstack()
      /home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/stack.go:1061 +0x416 fp=0x7f317bffed08 sp=0x7f317bffeb88
      runtime.morestack()
      /home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/asm_amd64.s:366 +0x7f fp=0x7f317bffed10 sp=0x7f317bffed08
      goroutine 10783 [running]:
      github.com/couchbase/plasma.(*item).getPtrKeyItem(0xc4a6b64003, 0x0)
      /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/item.go:100 fp=0xc5894cc2b8 sp=0xc5894cc2b0
      github.com/couchbase/plasma.(*item).Size(0xc4a6b64003, 0x0)

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            This appears to be due to a memory corruption in plasma, leading to a recursive call looping forever causing the stack to overflow.
            The root cause of the corruption is not clear yet.

            srinath.duvuru Srinath Duvuru added a comment - This appears to be due to a memory corruption in plasma, leading to a recursive call looping forever causing the stack to overflow. The root cause of the corruption is not clear yet.
            jliang John Liang added a comment -

            Tai, I am adding vulcan to fix version as well, since I suspect this will also happen in vulcan.

            jliang John Liang added a comment - Tai, I am adding vulcan to fix version as well, since I suspect this will also happen in vulcan.

            Got it, Thanks John Liang

            tai.tran Tai Tran (Inactive) added a comment - Got it, Thanks John Liang

            There is a memory corruption causing a thread to end up in an infinite loop of recursive calls. As neither the root cause of the corruption nor the stack of the thread of running into the issue are known, I am adding logic to exit (panic) the recursive call after a fixed number of recursions. This will give us a stack and hopefully provide some more clues to the issue.

            srinath.duvuru Srinath Duvuru added a comment - There is a memory corruption causing a thread to end up in an infinite loop of recursive calls. As neither the root cause of the corruption nor the stack of the thread of running into the issue are known, I am adding logic to exit (panic) the recursive call after a fixed number of recursions. This will give us a stack and hopefully provide some more clues to the issue.

            The current idea is to panic when the item pointer flag (meant to detect corruption) is set.

            sundar Sundar Sridharan (Inactive) added a comment - The current idea is to panic when the item pointer flag (meant to detect corruption) is set.

            We should review plasma codebase for potential areas where unsafe memory access is performed without holding BeginTx() and EndTx() barriers. There could be cases where a caller function holds the barriers and called function assumes that the caller has already acquired the barrier. 

            sarath Sarath Lakshman added a comment - We should review plasma codebase for potential areas where unsafe memory access is performed without holding BeginTx() and EndTx() barriers. There could be cases where a caller function holds the barriers and called function assumes that the caller has already acquired the barrier. 

            code merged to unstable, should go to master tonight.

            tai.tran Tai Tran (Inactive) added a comment - code merged to unstable, should go to master tonight.

            Build couchbase-server-5.5.0-2851 contains plasma commit 259ad19 with commit message:
            MB-29764 item: Add corruption check for item data

            build-team Couchbase Build Team added a comment - Build couchbase-server-5.5.0-2851 contains plasma commit 259ad19 with commit message: MB-29764 item: Add corruption check for item data

            Build couchbase-server-6.0.0-1212 contains plasma commit 259ad19 with commit message:
            MB-29764 item: Add corruption check for item data

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.0.0-1212 contains plasma commit 259ad19 with commit message: MB-29764 item: Add corruption check for item data

            This fix only avoids the stack overflow problem, but the root cause of the memory corruption that caused the overflow is still not addressed. The root cause of the corruption is hard to determine. With this fix it is still possible to still run into a corrupted item, but instead of running into a stack overflow, a panic will occur and a stack will be dumped. That should provide more details and will hopefully lead to determining the root cause.

            Also, we have requested information on the document details, like key types, length and will try a reproduction with synthetic data. Another exercise planned is to look through the potential areas where unsafe memory access is performed.

            srinath.duvuru Srinath Duvuru added a comment - This fix only avoids the stack overflow problem, but the root cause of the memory corruption that caused the overflow is still not addressed. The root cause of the corruption is hard to determine. With this fix it is still possible to still run into a corrupted item, but instead of running into a stack overflow, a panic will occur and a stack will be dumped. That should provide more details and will hopefully lead to determining the root cause. Also, we have requested information on the document details, like key types, length and will try a reproduction with synthetic data. Another exercise planned is to look through the potential areas where unsafe memory access is performed.

            adding due date as 6/11, we'll continue to try to catch it in Vulcan but because this is a 5.1.2 maintenance ticket we will determine next week on whether this should block Vulcan.

            tai.tran Tai Tran (Inactive) added a comment - adding due date as 6/11, we'll continue to try to catch it in Vulcan but because this is a 5.1.2 maintenance ticket we will determine next week on whether this should block Vulcan.

            Sarath Lakshman thought that the various fixes in MB-29800 can remedy this problem, CBSS-74 will try to re-produce the problem. 

            tai.tran Tai Tran (Inactive) added a comment - Sarath Lakshman thought that the various fixes in MB-29800 can remedy this problem, CBSS-74 will try to re-produce the problem. 

            resolve this ticket along with MB-29800 for Vulcan, Srinath will continue to track it down for 5.1.2 via MB-29952 (a clone of this ticket) and with unit tests for ticket CBSS-74. 

            tai.tran Tai Tran (Inactive) added a comment - resolve this ticket along with MB-29800 for Vulcan, Srinath will continue to track it down for 5.1.2 via  MB-29952  (a clone of this ticket) and with unit tests for ticket CBSS-74. 

            People

              srinath.duvuru Srinath Duvuru
              krishna.doddi Krishna Doddi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty