Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-56644

Memcached crashed in CheckpointManager::expelUnreferencedCheckpointItems() during rollback

    XMLWordPrintable

Details

    Description

      Steps To Recreate:

      1. Create a 4 node cluster
      2. Create a magma bucket with (bucket_history_retention_seconds=600,bucket_history_retention_bytes=6000000000)
      3. Create 5000000 items(doc size = 256)
      4. Start new doc ops(update:expiry)
      5. Trigger compaction
      6. SIGKILL memcached once
      7. Observed Memcached crashed in CheckpointManager::expelUnreferencedCheckpointItems (this=0x7f6bcc52de40)

      Note:
      Though actual test is about crash recovery .Basically keep killing memcached while data loading is going on and between two sigkill test waits for cluster warmup to finish and after warmup finishes test waits for 30 to 60 before next iteration of memcached kill, so total time between two sigkills is = warmup_time+30/60 seconds) , but in the case the crash was observed after first kill itself(since crash was observed memcached was killed just once)

      Core Dump was found on node 172.23.121.115

      BackTrace:

      (gdb) bt full
      #0  0x00007f6befeac8eb in raise () from /lib/x86_64-linux-gnu/libc.so.6
      No symbol table info available.
      #1  0x00007f6befe97535 in abort () from /lib/x86_64-linux-gnu/libc.so.6
      No symbol table info available.
      #2  0x00007f6bf046b63c in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
              terminating = false
              t = <optimized out>
      #3  0x0000000000b4d71b in backtrace_terminate_handler ()
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:88
      No locals.
      #4  0x00007f6bf04768f6 in __cxxabiv1::__terminate (handler=<optimized out>)
          at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
      No locals.
      #5  0x00007f6bf0476961 in std::terminate () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
      No locals.
      #6  0x00007f6bf0476bf4 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f6b980033b0, tinfo=tinfo@entry=0xc5fdc8 <typeinfo for gsl::fail_fast>,
          dest=dest@entry=0x59b9e0 <gsl::fail_fast::~fail_fast()>) at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_throw.cc:95
              globals = <optimized out>
              header = 0x7f6b98003330
      #7  0x00000000004506c3 in gsl::detail::fail_fast_throw (
          message=0xc8e3a8 "GSL: Precondition failure: 'extractRes.getExpelCursor().getCheckpoint()->get() == checkpoint' at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/checkpoint_manager.cc:"...)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite.hpp:1769
      No locals.
      #8  0x00000000004c2498 in CheckpointManager::expelUnreferencedCheckpointItems (this=0x7f6bcc52de40)
          at /opt/gcc-10.2.0/include/c++/10.2.0/bits/std_function.h:248
              lh = {_M_device = @0x7f6bcc52ded0}
              checkpoint = <optimized out>
              overheadCheck = <optimized out>
              extractRes = {
                items = {<boost::container::dtl::node_alloc_holder<MemoryTrackingAllocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> >, cb::NonNegativeCounter<unsigned long, cb::ClampAtZeroUnderflowPolicy> >, boost::intrusive::list_impl<boost::intrusive::bhtraits<boost::container::dtl::list_node<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> >, void*>, boost::intrusive::list_node_traits<void*>, (boost::intrusive::link_mode_type)0, boost::intrusive::dft_tag, 1>, unsigned long, true, void> >> = {<MemoryTrackingAllocator<boost::container::dtl::list_node<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> >, void*>, cb::NonNegativeCounter<unsigned long, cb::ClampAtZeroUnderflowPolicy> >> = {
                      baseAllocator = {<__gnu_cxx::new_allocator<boost::container::dtl::list_node<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> >, void*> >> = {<No data fields>}, <No data fields>},
                      bytesAllocated = {<std::__shared_ptr<cb::NonNegativeCounter<unsigned long, cb::ClampAtZeroUnderflowPolicy>, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<cb::NonNegativeCounter<unsigned long, cb::ClampAtZeroUnderflowPolicy>, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f6af5f5d550, _M_refcount = {_M_pi = 0x7f6af5f5d540}}, <No data fields>}}, m_icont = {static constant_time_size = true,
      --Type <RET> for more, q to quit, c to continue without paging--
                      static stateful_value_traits = <optimized out>, static has_container_from_iterator = <optimized out>,
                      static safemode_or_autounlink = false,
                      data_ = {<boost::intrusive::bhtraits<boost::container::dtl::list_node<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> >, void*>, boost::intrusive::list_node_traits<void*>, (boost::intrusive::link_mode_type)0, boost::intrusive::dft_tag, 1>> = {<boost::intrusive::bhtraits_base<boost::container::dtl::list_node<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> >, void*>, boost::intrusive::list_node<void*>*, boost::intrusive::dft_tag, 1>> = {<No data fields>}, static link_mode = boost::intrusive::normal_link},
                        root_plus_size_ = {<boost::intrusive::detail::size_holder<true, unsigned long, void>> = {
                            static constant_time_size = <optimized out>, size_ = 0}, m_header = {<boost::intrusive::list_node<void*>> = {
                              next_ = 0x7f6bce7ea140, prev_ = 0x7f6bce7ea140}, <No data fields>}}}}}, <No data fields>}, manager = 0x7f6bcc52de40,
                expelCursor = {<std::__shared_ptr<CheckpointCursor, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<CheckpointCursor, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f6b84531930, _M_refcount = {_M_pi = 0x7f6b84531920}}, <No data fields>},
                checkpoint = 0x7f6b224c0400}
              numItemsExpelled = 7326
              queuedItemsMemReleased = 1093838
              estimatedMemRecovered = <optimized out>
      #9  0x00000000007e604c in CheckpointMemRecoveryTask::attemptItemExpelling (this=<optimized out>)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/checkpoint_remover.cc:123
              vbid = {vbid = 514}
              vb = {<std::__shared_ptr<VBucket, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<VBucket, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f6b72dd6f00, _M_refcount = {_M_pi = 0x7f6b72fccca0}}, <No data fields>}
              expelResult = <optimized out>
              it = <error reading variable>
              __for_range = @0x7f6bce7ea230: {<std::_Vector_base<std::pair<Vbid, unsigned long>, std::allocator<std::pair<Vbid, unsigned long> > >> = {
                  _M_impl = {<std::allocator<std::pair<Vbid, unsigned long> >> = {<__gnu_cxx::new_allocator<std::pair<Vbid, unsigned long> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::pair<Vbid, unsigned long>, std::allocator<std::pair<Vbid, unsigned long> > >::_Vector_impl_data> = {
                      _M_start = 0x7f6af5ec9000, _M_finish = 0x7f6af5ecb000, _M_end_of_storage = 0x7f6af5ecb000}, <No data fields>}}, <No data fields>}
              __for_begin = <optimized out>
              __for_end = <optimized out>
       
              bucket = <error reading variable>
              vbuckets = {<std::_Vector_base<std::pair<Vbid, unsigned long>, std::allocator<std::pair<Vbid, unsigned long> > >> = {
                  _M_impl = {<std::allocator<std::pair<Vbid, unsigned long> >> = {<__gnu_cxx::new_allocator<std::pair<Vbid, unsigned long> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::pair<Vbid, unsigned long>, std::allocator<std::pair<Vbid, unsigned long> > >::_Vector_impl_data> = {
                      _M_start = 0x7f6af5ec9000, _M_finish = 0x7f6af5ecb000, _M_end_of_storage = 0x7f6af5ecb000}, <No data fields>}}, <No data fields>}
      #10 0x00000000007e6e18 in CheckpointMemRecoveryTask::runInner (this=0x7f6b75c1f3d0)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/checkpoint_remover.cc:265
              phosphor_internal_category_enabled_205 = {_M_b = {_M_p = 0x0}, static is_always_lock_free = <optimized out>}
              phosphor_internal_category_enabled_temp_205 = <optimized out>
              phosphor_internal_tpi_205 = {category = 0x0, name = 0x0, type = phosphor::TraceEventType::AsyncStart, argument_names = {_M_elems = {0x0,
                    0x0}}, argument_types = {_M_elems = {phosphor::TraceArgumentType::is_bool, phosphor::TraceArgumentType::is_bool}}}
              phosphor_internal_guard_205 = {tpi = 0x1081a80 <CheckpointMemRecoveryTask::runInner()::phosphor_internal_tpi_205>, enabled = true,
      --Type <RET> for more, q to quit, c to continue without paging--
                arg1 = {<No data fields>}, arg2 = {<No data fields>}, start = {__d = {__r = 3155132988588011}}}
              bucket = <error reading variable>
              wasAboveBackfillThreshold = false
              onReturn = <optimized out>
              bytesToFree = 302308378
      #11 0x0000000000abbd79 in GlobalTask::execute (this=0x7f6b75c1f3d0, threadName=...)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/globaltask.cc:98
              guard = {previous = 0x0}
              executedAt = <optimized out>
              scheduleOverhead = <optimized out>
              start = <optimized out>
              runAgain = <optimized out>
              end = <optimized out>
              runtime = <optimized out>
      #12 0x0000000000ab543a in FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}::operator()() const (__closure=0x7f6bce7ea630)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:309
              runAgain = <optimized out>
              proxy = <error reading variable>
      #13 0x0000000000abd12e in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f6bce7ea630)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
              fn = @0x7f6bce7ea630: {<folly::detail::function::FunctionTraits<void()>> = {<No data fields>}, data_ = {big = 0x7f6b774e3950, tiny = {
                    __data = "P9Nwk\177\000\000 \247~\316k\177\000\000\000\000\000\000\000\000\000\000@\346\326\356k\177\000\000\001\000\000\000\000\000\000\000\000\035\024\002\000\000\000", __align = {<No data fields>}}},
                call_ = 0xab5970 <folly::detail::function::FunctionTraits<void ()>::callSmall<FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}>(folly::detail::function::Data&)>,
                exec_ = 0xab3e60 <folly::detail::function::execSmall<FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}>(folly::detail::function::Op, folly::detail::function::Data*, folly::detail::function::Data)>}
      #14 operator() (__closure=<optimized out>)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
              task = {storage_ = {{emptyState = -48 '\320', value = {task = 0x7f6b75c1f3d0,
                      func = {<folly::detail::function::FunctionTraits<void()>> = {<No data fields>}, data_ = {big = 0x7f6b774e3950, tiny = {
                            __data = "P9Nwk\177\000\000 \247~\316k\177\000\000\000\000\000\000\000\000\000\000@\346\326\356k\177\000\000\001\000\000\000\000\000\000\000\000\035\024\002\000\000\000", __align = {<No data fields>}}},
                        call_ = 0xab5970 <folly::detail::function::FunctionTraits<void ()>::callSmall<FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}>(folly::detail::function::Data&)>,
                        exec_ = 0xab3e60 <folly::detail::function::execSmall<FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}>(folly::detail::function::Op, folly::detail::function::Data*, folly::detail::function::Data)>}}}, hasValue = true}}
              this = <optimized out>
      #15 0x0000000000c1b240 in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f6bce7ea820)
          at /home/couchbase/jenkins/cbdeps-ws/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:416
      --Type <RET> for more, q to quit, c to continue without paging--
              fn = @0x7f6bce7ea820: {<folly::detail::function::FunctionTraits<void()>> = {<No data fields>}, data_ = {big = 0x7f6beed0a800, tiny = {
                    __data = "\000\250\320\356k\177\000\000\320\367.\362k\177\000\000\060\000\000\000\000\000\000\000\301\223\000\000\000\000\000\000H\000\000\000\000\000\000\000\360\250~\316k\177\000", __align = {<No data fields>}}},
                call_ = 0xabd4b0 <folly::detail::function::FunctionTraits<void()>::callSmall<CancellableCPUExecutor::add(GlobalTask*, folly::Func)::<lambda()> >(folly::detail::function::Data &)>,
                exec_ = 0xabca60 <folly::detail::function::execSmall<CancellableCPUExecutor::add(GlobalTask*, folly::Func)::<lambda()> >(folly::detail::function::Op, folly::detail::function::Data *, folly::detail::function::Data *)>}
              fn = <optimized out>
       
      #16 folly::ThreadPoolExecutor::runTask (this=this@entry=0x7f6beed0a900, thread=..., task=...)
          at /home/couchbase/jenkins/cbdeps-ws/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/ThreadPoolExecutor.cpp:97
              rctx = {
                prev_ = {<std::__shared_ptr<folly::RequestContext, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<folly::RequestContext, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x0, _M_refcount = {_M_pi = 0x0}}, <No data fields>}}
              startTime = {__d = {__r = 3155132988580722}}
              stats = {expired = false, waitTime = {__r = 10766854}, runTime = {__r = 0}, enqueueTime = {__d = {__r = 3155132977813868}}, requestId = 0}
      #17 0x0000000000c05cda in folly::CPUThreadPoolExecutor::threadRun (this=0x7f6beed0a900, thread=...)
          at /home/couchbase/jenkins/cbdeps-ws/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/CPUThreadPoolExecutor.cpp:265
              task = {storage_ = {{emptyState = 0 '\000', value = {<folly::ThreadPoolExecutor::Task> = {
                        func_ = {<folly::detail::function::FunctionTraits<void()>> = {<No data fields>}, data_ = {big = 0x7f6beed0a800, tiny = {
                              __data = "\000\250\320\356k\177\000\000\320\367.\362k\177\000\000\060\000\000\000\000\000\000\000\301\223\000\000\000\000\000\000H\000\000\000\000\000\000\000\360\250~\316k\177\000", __align = {<No data fields>}}},
                          call_ = 0xabd4b0 <folly::detail::function::FunctionTraits<void()>::callSmall<CancellableCPUExecutor::add(GlobalTask*, folly::Func)::<lambda()> >(folly::detail::function::Data &)>,
                          exec_ = 0xabca60 <folly::detail::function::execSmall<CancellableCPUExecutor::add(GlobalTask*, folly::Func)::<lambda()> >(folly::detail::function::Op, folly::detail::function::Data *, folly::detail::function::Data *)>}, enqueueTime_ = {__d = {__r = 3155132977813868}},
                        expiration_ = {__r = 0}, expireCallback_ = {<folly::detail::function::FunctionTraits<void()>> = {<No data fields>}, data_ = {
                            big = 0x93c1, tiny = {
                              __data = "\301\223\000\000\000\000\000\000K\301\246", '\000' <repeats 13 times>, "_>\016\362k\177\000\000p\332\376\316k\177\000\000@\326.\362k\177\000",
       
      __align = {<No data fields>}}}, call_ = 0x466c57
           <folly::detail::function::FunctionTraits<void ()>::uninitCall(folly::detail::function::Data&)>, exec_ = 0x0},
                        context_ = {<std::__shared_ptr<folly::RequestContext, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<folly::RequestContext, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x0, _M_refcount = {_M_pi = 0x0}}, <No data fields>}}, poison = false,
                      priority_ = 0 '\000', queueObserverPayload_ = 140101544187152}}, hasValue = true}}
              guard = {list_ = {forbid = true,
      prev = 0x0, curr = {name = {static npos = <optimized out>, b_ = 0xce1613 "CPUThreadPoolExecutor",
                      e_ = 0xce1628 ""}}}}
      #18 0x0000000000c1e1f9 in std::__invoke_impl<void, void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__t=<optimized out>, __f=<optimized out>)
          at /usr/local/include/c++/7.3.0/bits/invoke.h:73
      No locals.
      #19 std::__invoke<void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_pt--Type <RET> for more, q to quit, c to continue without paging--
      r<folly::ThreadPoolExecutor::Thread>&> (__fn=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:95
      No locals.
      #20 std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>::__call<void, , 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) (__args=..., this=<optimized out>)
          at /usr/local/include/c++/7.3.0/functional:467
      No locals.
      #21 std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>::operator()<, void>() (this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:551
      No locals.
      #22 folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) (p=...)
          at /home/couchbase/jenkins/cbdeps-ws/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:401
              fn = <optimized out>
      #23 0x0000000000ab5134 in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f6beecd3c80)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
              fn = <error reading variable>
      #24 CBRegisteredThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}::operator()() (__closure=0x7f6beecd3c80)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
              threadNameOpt = {storage_ = {{emptyState = -128 '\200', value = {static npos = 18446744073709551615,
                      _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
                        _M_p = 0x7f6bce7ea980 "NonIoPool1"}, _M_string_length = 10, {_M_local_buf = "NonIoPool1\000\000\000\000\000",
                        _M_allocated_capacity = 8029725099528449870}}}, hasValue = true}}
       
              func = <error reading variable func (Cannot access memory at address 0x7f6beecd3c80)>
              func = <optimized out>
              threadNameOpt = <optimized out>
      #25 folly::detail::function::FunctionTraits<void ()>::callBig<CBRegisteredThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::detail::function::Data&) (p=...)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/server_build/tlm/deps/folly.exploded/include/folly/Function.h:401
              fn = <error reading variable>
      #26 0x00007f6bf049fd40 in std::execute_native_thread_routine (__p=0x7f6beec293c0)
          at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/thread.cc:80
              __t = <optimized out>
      #27 0x00007f6bf20a5fa3 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
      No symbol table info available.
      #28 0x00007f6beff6e06f in clone () from /lib/x86_64-linux-gnu/libc.so.6
      No symbol table info available.
      

      QE-TEST:

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.86484.ini bucket_storage=magma,rerun=false,GROUP=P0;kill,randomize_value=true,doc_size=256,bucket_eviction_policy=fullEviction,replicas=1,nodes_init=4,enable_dp=false,collect_pcaps=True,get-cbcollect-info=True,autoCompactionDefined=true,bucket_history_retention_seconds=600,bucket_history_retention_bytes=6000000000,upgrade_version=7.2.0-5318 -t storage.magma.magma_compaction.MagmaCompactionTests.test_crash_during_compaction,num_items=30000000,doc_size=256,graceful=False,doc_ops=update:expiry,replicas=1,GROUP=P0;kill'
      

      Job: http://qe-jenkins1.sc.couchbase.com/job/test_suite_executor-TAF/24359/consoleFull

       

      Issue Resolution
      In rare cases, after a failover or memcached restart, a replica rollback while under memory pressure might have caused a crash in the Data Service. Memory pressure recovery logic (Item expelling) is now skipped when replica rollback is in progress.

      Attachments

        For Gerrit Dashboard: MB-56644
        # Subject Branch Project Status CR V

        Activity

          People

            ankush.sharma Ankush Sharma
            ankush.sharma Ankush Sharma
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty