Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-29764

Indexer crashes with goroutine stack exceeds 1000000000-byte limit

    XMLWordPrintable

Details

    • Triaged
    • Unknown
    • Storage-Sprint-End-Jun-1-2018, Storage-Sprint-End-Jun-15-2018

    Description

      Indexer process goroutine stack exceeds 1000000000-byte limit fatal error: stack overflow and cause the indexer to crash.

      This is the stack

      StorageMgr::handleCreateSnapshot Added New Snapshot Index: 2954511192241179090 PartitionId: 0 SliceId: 0 Crc64: 3092221115143794419 (SnapshotInfo: count:10889206 committed:false) SnapCreateDur 62.255µs SnapOpenDur 1.078635ms
      runtime: goroutine stack exceeds 1000000000-byte limit
      fatal error: stack overflow
      runtime stack:
      runtime.throw(0xe730bc, 0xe)
      /home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/panic.go:566 +0x95 fp=0x7f317bffeb88 sp=0x7f317bffeb68
      runtime.newstack()
      /home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/stack.go:1061 +0x416 fp=0x7f317bffed08 sp=0x7f317bffeb88
      runtime.morestack()
      /home/couchbase/.cbdepscache/exploded/x86_64/go-1.7.3/go/src/runtime/asm_amd64.s:366 +0x7f fp=0x7f317bffed10 sp=0x7f317bffed08
      goroutine 10783 [running]:
      github.com/couchbase/plasma.(*item).getPtrKeyItem(0xc4a6b64003, 0x0)
      /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/plasma/item.go:100 fp=0xc5894cc2b8 sp=0xc5894cc2b0
      github.com/couchbase/plasma.(*item).Size(0xc4a6b64003, 0x0)

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            krishna.doddi Krishna Doddi created issue -
            matt.carabine Matt Carabine made changes -
            Field Original Value New Value
            Link This issue blocks CBSE-5247 [ CBSE-5247 ]
            matt.carabine Matt Carabine made changes -
            Component/s storage-engine [ 10175 ]
            matt.carabine Matt Carabine made changes -
            Assignee Jeelan Poola [ jeelan.poola ] Srinath Duvuru [ srinath.duvuru ]
            jliang John Liang made changes -
            Component/s secondary-index [ 11211 ]

            This appears to be due to a memory corruption in plasma, leading to a recursive call looping forever causing the stack to overflow.
            The root cause of the corruption is not clear yet.

            srinath.duvuru Srinath Duvuru added a comment - This appears to be due to a memory corruption in plasma, leading to a recursive call looping forever causing the stack to overflow. The root cause of the corruption is not clear yet.
            tai.tran Tai Tran (Inactive) made changes -
            Fix Version/s 5.1.2 [ 15204 ]
            tai.tran Tai Tran (Inactive) made changes -
            Sprint Storage-Sprint-End-Jun-1-2018 [ 588 ]
            tai.tran Tai Tran (Inactive) made changes -
            Rank Ranked higher
            jliang John Liang added a comment -

            Tai, I am adding vulcan to fix version as well, since I suspect this will also happen in vulcan.

            jliang John Liang added a comment - Tai, I am adding vulcan to fix version as well, since I suspect this will also happen in vulcan.
            jliang John Liang made changes -
            Fix Version/s vulcan [ 14610 ]
            jliang John Liang made changes -
            Priority Major [ 3 ] Critical [ 2 ]

            Got it, Thanks John Liang

            tai.tran Tai Tran (Inactive) added a comment - Got it, Thanks John Liang

            There is a memory corruption causing a thread to end up in an infinite loop of recursive calls. As neither the root cause of the corruption nor the stack of the thread of running into the issue are known, I am adding logic to exit (panic) the recursive call after a fixed number of recursions. This will give us a stack and hopefully provide some more clues to the issue.

            srinath.duvuru Srinath Duvuru added a comment - There is a memory corruption causing a thread to end up in an infinite loop of recursive calls. As neither the root cause of the corruption nor the stack of the thread of running into the issue are known, I am adding logic to exit (panic) the recursive call after a fixed number of recursions. This will give us a stack and hopefully provide some more clues to the issue.
            wayne Wayne Siu made changes -
            Triage Untriaged [ 10351 ] Triaged [ 10350 ]
            sarath Sarath Lakshman made changes -
            Labels customer secondary-index customer plasma secondary-index

            The current idea is to panic when the item pointer flag (meant to detect corruption) is set.

            sundar Sundar Sridharan (Inactive) added a comment - The current idea is to panic when the item pointer flag (meant to detect corruption) is set.
            tai.tran Tai Tran (Inactive) made changes -
            Due Date 04/Jun/18
            tai.tran Tai Tran (Inactive) made changes -
            Sprint Storage-Sprint-End-Jun-1-2018 [ 588 ] Storage-Sprint-End-Jun-1-2018, Storage-Sprint-End-Jun-29-2019 [ 588, 591 ]
            tai.tran Tai Tran (Inactive) made changes -
            Rank Ranked higher
            tai.tran Tai Tran (Inactive) made changes -
            Sprint Storage-Sprint-End-Jun-1-2018, Storage-Sprint-End-Jun-29-2019 [ 588, 591 ] Storage-Sprint-End-Jun-1-2018, Storage-Sprint-End-Jun-15-2018 [ 588, 590 ]
            tai.tran Tai Tran (Inactive) made changes -
            Rank Ranked lower
            tai.tran Tai Tran (Inactive) made changes -
            Sprint Storage-Sprint-End-Jun-1-2018, Storage-Sprint-End-Jun-15-2018 [ 588, 590 ] Storage-Sprint-End-Jun-1-2018, Storage-Sprint-End-Jun-29-2019 [ 588, 591 ]
            tai.tran Tai Tran (Inactive) made changes -
            Rank Ranked higher
            tai.tran Tai Tran (Inactive) made changes -
            Sprint Storage-Sprint-End-Jun-1-2018, Storage-Sprint-End-Jun-29-2019 [ 588, 591 ] Storage-Sprint-End-Jun-1-2018, Storage-Sprint-End-Jun-15-2018 [ 588, 590 ]
            tai.tran Tai Tran (Inactive) made changes -
            Rank Ranked lower

            We should review plasma codebase for potential areas where unsafe memory access is performed without holding BeginTx() and EndTx() barriers. There could be cases where a caller function holds the barriers and called function assumes that the caller has already acquired the barrier. 

            sarath Sarath Lakshman added a comment - We should review plasma codebase for potential areas where unsafe memory access is performed without holding BeginTx() and EndTx() barriers. There could be cases where a caller function holds the barriers and called function assumes that the caller has already acquired the barrier. 

            code merged to unstable, should go to master tonight.

            tai.tran Tai Tran (Inactive) added a comment - code merged to unstable, should go to master tonight.
            tai.tran Tai Tran (Inactive) made changes -
            Link This issue relates to MB-29952 [ MB-29952 ]
            tai.tran Tai Tran (Inactive) made changes -
            Fix Version/s 5.1.2 [ 15204 ]

            Build couchbase-server-5.5.0-2851 contains plasma commit 259ad19 with commit message:
            MB-29764 item: Add corruption check for item data

            build-team Couchbase Build Team added a comment - Build couchbase-server-5.5.0-2851 contains plasma commit 259ad19 with commit message: MB-29764 item: Add corruption check for item data

            Build couchbase-server-6.0.0-1212 contains plasma commit 259ad19 with commit message:
            MB-29764 item: Add corruption check for item data

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.0.0-1212 contains plasma commit 259ad19 with commit message: MB-29764 item: Add corruption check for item data

            This fix only avoids the stack overflow problem, but the root cause of the memory corruption that caused the overflow is still not addressed. The root cause of the corruption is hard to determine. With this fix it is still possible to still run into a corrupted item, but instead of running into a stack overflow, a panic will occur and a stack will be dumped. That should provide more details and will hopefully lead to determining the root cause.

            Also, we have requested information on the document details, like key types, length and will try a reproduction with synthetic data. Another exercise planned is to look through the potential areas where unsafe memory access is performed.

            srinath.duvuru Srinath Duvuru added a comment - This fix only avoids the stack overflow problem, but the root cause of the memory corruption that caused the overflow is still not addressed. The root cause of the corruption is hard to determine. With this fix it is still possible to still run into a corrupted item, but instead of running into a stack overflow, a panic will occur and a stack will be dumped. That should provide more details and will hopefully lead to determining the root cause. Also, we have requested information on the document details, like key types, length and will try a reproduction with synthetic data. Another exercise planned is to look through the potential areas where unsafe memory access is performed.
            tai.tran Tai Tran (Inactive) made changes -
            Due Date 04/Jun/18 11/Jun/18

            adding due date as 6/11, we'll continue to try to catch it in Vulcan but because this is a 5.1.2 maintenance ticket we will determine next week on whether this should block Vulcan.

            tai.tran Tai Tran (Inactive) added a comment - adding due date as 6/11, we'll continue to try to catch it in Vulcan but because this is a 5.1.2 maintenance ticket we will determine next week on whether this should block Vulcan.
            wayne Wayne Siu made changes -
            Link This issue blocks MB-29253 [ MB-29253 ]

            Sarath Lakshman thought that the various fixes in MB-29800 can remedy this problem, CBSS-74 will try to re-produce the problem. 

            tai.tran Tai Tran (Inactive) added a comment - Sarath Lakshman thought that the various fixes in MB-29800 can remedy this problem, CBSS-74 will try to re-produce the problem. 

            resolve this ticket along with MB-29800 for Vulcan, Srinath will continue to track it down for 5.1.2 via MB-29952 (a clone of this ticket) and with unit tests for ticket CBSS-74. 

            tai.tran Tai Tran (Inactive) added a comment - resolve this ticket along with MB-29800 for Vulcan, Srinath will continue to track it down for 5.1.2 via  MB-29952  (a clone of this ticket) and with unit tests for ticket CBSS-74. 
            tai.tran Tai Tran (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            ritam.sharma Ritam Sharma made changes -
            VERIFICATION STEPS http://qa.sc.couchbase.com/job/centos-systest-launcher/1524/parameters/
            Longevity test has been run for 3 days for build 2941. Issue not seen hence closing the issue.
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              srinath.duvuru Srinath Duvuru
              krishna.doddi Krishna Doddi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty