Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45459

[System Test] : Memcached crashes seen on 1 KV node - ThrowExceptionUnderflowPolicy current:0 arg:-143

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-4857
      Test : -test tests/integration/cheshirecat/test_cheshirecat_kv_gsi_coll_xdcr_backup_sgw_fts_itemct_txns_eventing_cbas_scale3.yml -scope tests/integration/cheshirecat/scope_cheshirecat_with_backup.yml
      Scale : 3
      Iteration : 1st

      On 172.23.120.86, between 2021-04-03T23:05:36 & 2021-04-03T23:08:50, there are 11 occurences of memcached crashes with the following stack trace. The one mentioned below is the first one.

      2021-04-03T23:05:36.212700-07:00  *** Fatal error encountered during exception handling ***
      2021-04-03T23:05:36.212757-07:00 CRITICAL Caught unhandled std::exception-derived exception. what(): ThrowExceptionUnderflowPolicy current:0 arg:-143
      2021-04-03T23:05:36.212761-07:00 CRITICAL Exception thrown from:
      2021-04-03T23:05:36.212792-07:00 CRITICAL     #0  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x80730]
      2021-04-03T23:05:36.212802-07:00 CRITICAL     #1  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x8093c]
      2021-04-03T23:05:36.212814-07:00 CRITICAL     #2  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x267722]
      2021-04-03T23:05:36.212846-07:00 CRITICAL     #3  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x2320df]
      2021-04-03T23:05:36.212856-07:00 CRITICAL     #4  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x232e11]
      2021-04-03T23:05:36.212865-07:00 CRITICAL     #5  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x111b42]
      2021-04-03T23:05:36.212873-07:00 CRITICAL     #6  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x116437]
      2021-04-03T23:05:36.212880-07:00 CRITICAL     #7  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x116b77]
      2021-04-03T23:05:36.212889-07:00 CRITICAL     #8  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x16d767]
      2021-04-03T23:05:36.212896-07:00 CRITICAL     #9  /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x16e6b9]
      2021-04-03T23:05:36.212904-07:00 CRITICAL     #10 /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x170fc3]
      2021-04-03T23:05:36.212910-07:00 CRITICAL     #11 /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x16b2bf]
      2021-04-03T23:05:36.212920-07:00 CRITICAL     #12 /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x2e6a30]
      2021-04-03T23:05:36.212930-07:00 CRITICAL     #13 /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x2cef0a]
      2021-04-03T23:05:36.212939-07:00 CRITICAL     #14 /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x2e99e9]
      2021-04-03T23:05:36.212948-07:00 CRITICAL     #15 /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x169828]
      2021-04-03T23:05:36.212955-07:00 CRITICAL     #16 /opt/couchbase/bin/../lib/libep.so() [0x7fcfbeb84000+0x169703]
      2021-04-03T23:05:36.212996-07:00 CRITICAL     #17 /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fcfbb42f000+0xb9dcf]
      2021-04-03T23:05:36.213002-07:00 CRITICAL     #18 /lib64/libpthread.so.0() [0x7fcfbacfa000+0x7e65]
      2021-04-03T23:05:36.213038-07:00 CRITICAL     #19 /lib64/libc.so.6(clone+0x6d) [0x7fcfba92c000+0xfe88d]
      2021-04-03T23:05:36.472075-07:00 WARNING (No Engine) Slow runtime for 'DurabilityTimeoutVisitor on vb:251' on thread NonIoPool1: 214 ms
      2021-04-03T23:05:36.473734-07:00 WARNING (No Engine) Slow runtime for 'DurabilityTimeoutVisitor on vb:874' on thread NonIoPool0: 216 ms
      2021-04-03T23:05:36.530701-07:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-4857). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/f8012fcb-fe5b-4485-7ec5a98e-a983712c.dmp before terminating.
      2021-04-03T23:05:36.530733-07:00 CRITICAL Stack backtrace of crashed thread:
      2021-04-03T23:05:36.530902-07:00 CRITICAL     #0  /opt/couchbase/bin/memcached() [0x400000+0x14b8dd]
      2021-04-03T23:05:36.530910-07:00 CRITICAL     #1  /opt/couchbase/bin/../lib/libdefault_engine.so(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x7fcfbf249000+0x3090a]
      2021-04-03T23:05:36.530917-07:00 CRITICAL     #2  /opt/couchbase/bin/../lib/libdefault_engine.so(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x7fcfbf249000+0x30c48]
      2021-04-03T23:05:36.530925-07:00 CRITICAL     #3  /lib64/libpthread.so.0() [0x7fcfbacfa000+0xf5f0]
      2021-04-03T23:05:36.530948-07:00 CRITICAL     #4  /lib64/libc.so.6(gsignal+0x37) [0x7fcfba92c000+0x36337]
      2021-04-03T23:05:36.531093-07:00 CRITICAL     #5  /lib64/libc.so.6(abort+0x148) [0x7fcfba92c000+0x37a28]
      2021-04-03T23:05:36.531140-07:00 CRITICAL     #6  /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7fcfbb42f000+0x91195]
      2021-04-03T23:05:36.531166-07:00 CRITICAL     #7  /opt/couchbase/bin/memcached() [0x400000+0x15a9d2]
      2021-04-03T23:05:36.531193-07:00 CRITICAL     #8  /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fcfbb42f000+0x8ef86]
      2021-04-03T23:05:36.531215-07:00 CRITICAL     #9  /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fcfbb42f000+0x8efd1]
      2021-04-03T23:05:36.531241-07:00 CRITICAL     #10 /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fcfbb42f000+0xb9dfe]
      2021-04-03T23:05:36.531248-07:00 CRITICAL     #11 /lib64/libpthread.so.0() [0x7fcfbacfa000+0x7e65]
      2021-04-03T23:05:36.531334-07:00 CRITICAL     #12 /lib64/libc.so.6(clone+0x6d) [0x7fcfba92c000+0xfe88d]
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Suspect there's an intersection of the two fixes that Jim Walker listed above that affect this stat that wasn't accounted for. It looks like a "SuccessExistingItem" result from Checkpoint::queueDirty() will account the wrong value if the flush eventually fails as we use a pre-adjusted value in the persistenceFailureStatOvercounts. Same for queue time. Will write a unit test tomorrow and verify.

            ben.huddleston Ben Huddleston added a comment - Suspect there's an intersection of the two fixes that Jim Walker listed above that affect this stat that wasn't accounted for. It looks like a "SuccessExistingItem" result from Checkpoint::queueDirty() will account the wrong value if the flush eventually fails as we use a pre-adjusted value in the persistenceFailureStatOvercounts. Same for queue time. Will write a unit test tomorrow and verify.

            The issue here occurs when we update the size of the item while a flush is running and the subsequent flush fails. Code was added in http://review.couchbase.org/c/kv_engine/+/148286 to deal with the queue time changing for this item but it does not consider the size (which contributes towards dirtyQueuePendingWrites).

            ben.huddleston Ben Huddleston added a comment - The issue here occurs when we update the size of the item while a flush is running and the subsequent flush fails. Code was added in http://review.couchbase.org/c/kv_engine/+/148286 to deal with the queue time changing for this item but it does not consider the size (which contributes towards dirtyQueuePendingWrites).

            Build couchbase-server-7.0.0-4934 contains kv_engine commit 0c84c55 with commit message:
            MB-45459: Remove unused param from VBucket::accountItem()

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4934 contains kv_engine commit 0c84c55 with commit message: MB-45459 : Remove unused param from VBucket::accountItem()

            Build couchbase-server-7.0.0-4934 contains kv_engine commit ff66224 with commit message:
            MB-45459: Pass old item to persistenceFailureStatOvercounts

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4934 contains kv_engine commit ff66224 with commit message: MB-45459 : Pass old item to persistenceFailureStatOvercounts

            Not seeing this issue reproduce in the longevity test run on 7.0.0-4955. The test completed the first iteration.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Not seeing this issue reproduce in the longevity test run on 7.0.0-4955. The test completed the first iteration.

            People

              mihir.kamdar Mihir Kamdar (Inactive)
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty