Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
None
-
Untriaged
-
0
-
Unknown
Description
Issue occurs in development sandbox - requires uncommitted changes....
Bug noted when doing a 2->3 node rebalance with history retention enabled, the rebalance made some progress but ultimatley stopped. It was noted that at least one node was deadlocked with writer (kv flush) and auxio (dcp backfill) threads possibly locked.
This was noted on cluster_run (MacOS) with the following magma change.
- Patch 19 -> https://review.couchbase.org/c/magma/+/184395/19
- kv_engine changes not yet in gerrit, but this commit (and siblings) made up the KV side https://github.com/jimwwalker/kv_engine/commit/9ba68fb2acb96e34bb7f512eba1edfee9486a2ca
I've captured the lldb backtrace, and have a complete bt and a trimmed one which hopefully shows the interesting paths.
E.g. the following auxio is in mutex lock and the writer thread is waiting on a condvar (check uploaded files for full trace output)
thread #38, name = 'AuxIoPool2'
|
frame #0: 0x00007ff81745fbd2 libsystem_kernel.dylib`__psynch_mutexwait + 10
|
frame #1: 0x00007ff817497e7e libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
|
frame #2: 0x00007ff817495cbb libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 205
|
frame #3: 0x00007ff8173fa719 libc++.1.dylib`std::__1::mutex::lock() + 9
|
frame #4: 0x000000010a891081 memcached`magma::WALOffset::String() [inlined] std::__1::basic_string<char, std:
|
frame #5: 0x000000010a89225a memcached`magma::WALOffset::String(this=<unavailable>) at wal_offset.cc:25 [opt]
|
frame #6: 0x000000010a80631a memcached`magma::Magma::Impl::syncKVStore(this=<unavailable>, kvID=<unavailable>
|
frame #7: 0x000000010a806122 memcached`magma::Magma::Impl::SyncKVStore(this=<unavailable>, kvID=<unavailable>
|
frame #8: 0x000000010a806451 memcached`magma::Magma::SyncKVStore(this=<unavailable>, kvID=<unavailable>) at d
|
frame #9: 0x000000010a58206f memcached`MagmaMemoryTrackingProxy::SyncKVStore(this=0x0000000110c33058, kvID=<u
|
frame #10: 0x000000010a5db4b9 memcached`MagmaKVStore::makeFileHandle(this=0x0000000110c0d400, vbid=(vbid = 9)
|
frame #11: 0x000000010a5d2a0f memcached`MagmaKVStore::initBySeqnoScanContext(this=0x0000000110c0d400, cb=Stat
|
frame #12: 0x000000010a791eaa memcached`DCPBackfillBySeqnoDisk::create(this=0x0000000110ecbec0) at backfill_b
|
frame #13: 0x000000010a794ac2 memcached`DCPBackfillDisk::run(this=0x0000000110ecbec0) at backfill_disk.cc:151
|
frame #14: 0x000000010a795a9f memcached`BackfillManager::backfill(this=0x00000001110c8998) at backfill-manage
|
frame #15: 0x000000010a795609 memcached`BackfillManagerTask::run(this=0x000000010b934738) at backfill-manager
|
frame #16: 0x000000010a9048bf memcached`GlobalTask::execute(this=0x000000010b934738, threadName="AuxIoPool2")
|
...
|
frame #35: 0x00007ff81749a4e1 libsystem_pthread.dylib`_pthread_start + 125
|
frame #36: 0x00007ff817495f6b libsystem_pthread.dylib`thread_start + 15
|
|
thread #32, name = 'WriterPool0'
|
frame #0: 0x00007ff8174603ea libsystem_kernel.dylib`__psynch_cvwait + 10
|
frame #1: 0x00007ff81749aa6f libsystem_pthread.dylib`_pthread_cond_wait + 1249
|
frame #2: 0x00007ff8173f8d93 libc++.1.dylib`std::__1::condition_variable::__do_timed_wait(std::__1::unique_
|
frame #3: 0x000000010a87d433 memcached`magma::WALOffset::String() [inlined] std::__1::basic_stringbuf<char,
|
frame #4: 0x000000010a887c1e memcached`magma::WALOffset::String() [inlined] std::__1::basic_stringbuf<char,
|
frame #5: 0x000000010a890658 memcached`magma::WALOffset::String() [inlined] std::__1::basic_ios<char, std::
|
frame #6: 0x000000010a896575 memcached`nlohmann::basic_json<std::__1::map, std::__1::vector, std::__1::basi
|
frame #7: 0x000000010a8238ef memcached`magma::Magma::Impl::WriteDocs(this=<unavailable>, kvID=<unavailable>
|
frame #8: 0x000000010a823c2e memcached`magma::Magma::WriteDocs(this=<unavailable>, kvID=<unavailable>, docO
|
frame #9: 0x000000010a582263 memcached`MagmaMemoryTrackingProxy::WriteDocs(this=<unavailable>, kvID=<unavai
|
frame #10: 0x000000010a5d22b9 memcached`MagmaKVStore::saveDocs(this=0x000000010d153b00, txnCtx=<unavailable
|
frame #11: 0x000000010a5ce3f2 memcached`MagmaKVStore::commit(this=0x000000010d153b00, txnCtx=TransactionCon
|
frame #12: 0x000000010a6f5e49 memcached`EPBucket::commit(this=0x0000000110783000, kvstore=0x000000010d153b0
|
frame #13: 0x000000010a6f4d85 memcached`EPBucket::flushVBucket_UNLOCKED(this=0x0000000110783000, vb=<unavai
|
frame #14: 0x000000010a6f3bff memcached`EPBucket::flushVBucket(this=0x0000000110783000, vbid=<unavailable>)
|
frame #15: 0x000000010a6d0e94 memcached`Flusher::flushVB(this=0x0000000110e56800) at flusher.cc:293:29 [opt
|
frame #16: 0x000000010a6d0a47 memcached`Flusher::step(this=0x0000000110e56800, task=0x0000000110ef4538) at
|
frame #17: 0x000000010a9048bf memcached`GlobalTask::execute(this=0x0000000110ef4538, threadName="WriterPool
|
...
|
frame #36: 0x00007ff81749a4e1 libsystem_pthread.dylib`_pthread_start + 125
|
frame #37: 0x00007ff817495f6b libsystem_pthread.dylib`thread_start + 15
|
|