Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55271

CDC: deadlock in magma (rebalance test)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • storage-engine
    • None
    • Untriaged
    • 0
    • Unknown

    Description

      Issue occurs in development sandbox - requires uncommitted changes....

      Bug noted when doing a 2->3 node rebalance with history retention enabled, the rebalance made some progress but ultimatley stopped. It was noted that at least one node was deadlocked with writer (kv flush) and auxio (dcp backfill) threads possibly locked.

      This was noted on cluster_run (MacOS) with the following magma change.

      I've captured the lldb backtrace, and have a complete bt and a trimmed one which hopefully shows the interesting paths.

      E.g. the following auxio is in mutex lock and the writer thread is waiting on a condvar (check uploaded files for full trace output)

      thread #38, name = 'AuxIoPool2'
       frame #0: 0x00007ff81745fbd2 libsystem_kernel.dylib`__psynch_mutexwait + 10
       frame #1: 0x00007ff817497e7e libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
       frame #2: 0x00007ff817495cbb libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 205
       frame #3: 0x00007ff8173fa719 libc++.1.dylib`std::__1::mutex::lock() + 9
       frame #4: 0x000000010a891081 memcached`magma::WALOffset::String() [inlined] std::__1::basic_string<char, std:
       frame #5: 0x000000010a89225a memcached`magma::WALOffset::String(this=<unavailable>) at wal_offset.cc:25 [opt]
       frame #6: 0x000000010a80631a memcached`magma::Magma::Impl::syncKVStore(this=<unavailable>, kvID=<unavailable>
       frame #7: 0x000000010a806122 memcached`magma::Magma::Impl::SyncKVStore(this=<unavailable>, kvID=<unavailable>
       frame #8: 0x000000010a806451 memcached`magma::Magma::SyncKVStore(this=<unavailable>, kvID=<unavailable>) at d
       frame #9: 0x000000010a58206f memcached`MagmaMemoryTrackingProxy::SyncKVStore(this=0x0000000110c33058, kvID=<u
       frame #10: 0x000000010a5db4b9 memcached`MagmaKVStore::makeFileHandle(this=0x0000000110c0d400, vbid=(vbid = 9)
       frame #11: 0x000000010a5d2a0f memcached`MagmaKVStore::initBySeqnoScanContext(this=0x0000000110c0d400, cb=Stat
       frame #12: 0x000000010a791eaa memcached`DCPBackfillBySeqnoDisk::create(this=0x0000000110ecbec0) at backfill_b
       frame #13: 0x000000010a794ac2 memcached`DCPBackfillDisk::run(this=0x0000000110ecbec0) at backfill_disk.cc:151
       frame #14: 0x000000010a795a9f memcached`BackfillManager::backfill(this=0x00000001110c8998) at backfill-manage
       frame #15: 0x000000010a795609 memcached`BackfillManagerTask::run(this=0x000000010b934738) at backfill-manager
       frame #16: 0x000000010a9048bf memcached`GlobalTask::execute(this=0x000000010b934738, threadName="AuxIoPool2")
       ...
       frame #35: 0x00007ff81749a4e1 libsystem_pthread.dylib`_pthread_start + 125
       frame #36: 0x00007ff817495f6b libsystem_pthread.dylib`thread_start + 15
       
      thread #32, name = 'WriterPool0'
       frame #0: 0x00007ff8174603ea libsystem_kernel.dylib`__psynch_cvwait + 10
       frame #1: 0x00007ff81749aa6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
       frame #2: 0x00007ff8173f8d93 libc++.1.dylib`std::__1::condition_variable::__do_timed_wait(std::__1::unique_
       frame #3: 0x000000010a87d433 memcached`magma::WALOffset::String() [inlined] std::__1::basic_stringbuf<char,
       frame #4: 0x000000010a887c1e memcached`magma::WALOffset::String() [inlined] std::__1::basic_stringbuf<char,
       frame #5: 0x000000010a890658 memcached`magma::WALOffset::String() [inlined] std::__1::basic_ios<char, std::
       frame #6: 0x000000010a896575 memcached`nlohmann::basic_json<std::__1::map, std::__1::vector, std::__1::basi
       frame #7: 0x000000010a8238ef memcached`magma::Magma::Impl::WriteDocs(this=<unavailable>, kvID=<unavailable>
       frame #8: 0x000000010a823c2e memcached`magma::Magma::WriteDocs(this=<unavailable>, kvID=<unavailable>, docO
       frame #9: 0x000000010a582263 memcached`MagmaMemoryTrackingProxy::WriteDocs(this=<unavailable>, kvID=<unavai
       frame #10: 0x000000010a5d22b9 memcached`MagmaKVStore::saveDocs(this=0x000000010d153b00, txnCtx=<unavailable
       frame #11: 0x000000010a5ce3f2 memcached`MagmaKVStore::commit(this=0x000000010d153b00, txnCtx=TransactionCon
       frame #12: 0x000000010a6f5e49 memcached`EPBucket::commit(this=0x0000000110783000, kvstore=0x000000010d153b0
       frame #13: 0x000000010a6f4d85 memcached`EPBucket::flushVBucket_UNLOCKED(this=0x0000000110783000, vb=<unavai
       frame #14: 0x000000010a6f3bff memcached`EPBucket::flushVBucket(this=0x0000000110783000, vbid=<unavailable>)
       frame #15: 0x000000010a6d0e94 memcached`Flusher::flushVB(this=0x0000000110e56800) at flusher.cc:293:29 [opt
       frame #16: 0x000000010a6d0a47 memcached`Flusher::step(this=0x0000000110e56800, task=0x0000000110ef4538) at 
       frame #17: 0x000000010a9048bf memcached`GlobalTask::execute(this=0x0000000110ef4538, threadName="WriterPool
       ...
       frame #36: 0x00007ff81749a4e1 libsystem_pthread.dylib`_pthread_start + 125
       frame #37: 0x00007ff817495f6b libsystem_pthread.dylib`thread_start + 15
      
      

      Attachments

        1. bt_all_trimmed.txt
          235 kB
        2. bt_all.txt
          488 kB
        3. logs.tgz
          25.54 MB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            apaar.gupta Apaar Gupta
            jwalker Jim Walker
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty