Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51608

Memcached crashes in 20 bucket throughput test due to exception ThreadLocalData::getTCacheID: tcache.create failed rv:14

    XMLWordPrintable

Details

    Description

      Several crashes are observed in a magma 20 bucket throughput test. A 10 bucket variant of the test doesn't crash.

      The test creates 20 buckets, loads 50M docs each, then overwrite 50M docs and finally run an access phase of 50:50 read-writes. The crashes are observed during the access phase.

      je_mallctl("tcache.create",..) is returning 14 (EFAULT).

      Test run:
      http://perf.jenkins.couchbase.com/job/rhea-dev2/130/console

      BackTrace:

      (gdb) bt
      #0  0x00007fe73722b3d7 in raise () from /lib64/libc.so.6
      #1  0x00007fe73722cac8 in abort () from /lib64/libc.so.6
      #2  0x00007fe737b7663c in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
      #3  0x0000000000b3255b in backtrace_terminate_handler() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:88
      #4  0x00007fe737b818f6 in __cxxabiv1::__terminate(void (*)()) () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
      #5  0x00007fe737b81961 in std::terminate () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
      #6  0x00007fe737b81bf4 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x107f4e0 <typeinfo for std::runtime_error>, dest=0x444ef0 <_ZNSt13runtime_errorD1Ev@plt>)
          at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_throw.cc:95
      #7  0x000000000053c64f in cb::ThreadLocalData::getTCacheID (this=0x7fe4aefedac0, client=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/je_arena_malloc.cc:373
      #8  0x0000000000bab3cb in cb::_JEArenaMalloc<cb::JEArenaCoreLocalTracker>::switchToClient(cb::ArenaMallocClient const&, cb::MemoryDomain, bool) ()
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/je_arena_malloc.cc:225
      #9  0x000000000074f2b0 in switchToClient (tcache=true, domain=cb::Primary, client=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_engine.h:855
      #10 switchToEngine (domain=cb::Primary, want_old_thread_local=true, engine=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/objectregistry.cc:123
      #11 onSwitchThread (domain=cb::Primary, want_old_thread_local=true, engine=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/objectregistry.cc:136
      #12 BucketAllocationGuard::BucketAllocationGuard (this=0x7fe4aefeb3a0, engine=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/objectregistry.cc:148
      #13 0x0000000000aa0f85 in GlobalTask::execute(std::basic_string_view<char, std::char_traits<char> >) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/globaltask.cc:72
      #14 0x0000000000a9a6aa in FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}::operator()() const (__closure=0x7fe4aefeb650)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:309
      #15 0x0000000000aa238e in operator() (this=0x7fe4aefeb650) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
      #16 CancellableCPUExecutor::add(GlobalTask*, folly::Function<void ()>)::{lambda()#1}::operator()() const () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
      #17 0x0000000000bf9a90 in operator() (this=0x7fe4aefeb840) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:416
      #18 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&) (this=this@entry=0x7fe734ddf900, thread=...,
          task=task@entry=<unknown type in /usr/lib/debug/opt/couchbase/bin/memcached-7.1.0-2506.x86_64.debug, CU 0xa40c0ed, DIE 0xa490022>)
          at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/ThreadPoolExecutor.cpp:97
      #19 0x0000000000be452a in folly::CPUThreadPoolExecutor::threadRun (this=0x7fe734ddf900, thread=...)
          at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/CPUThreadPoolExecutor.cpp:265
      #20 0x0000000000bfca49 in __invoke_impl<void, void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (
          __t=<optimized out>, __f=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:73
      #21 __invoke<void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__fn=<optimized out>)
          at /usr/local/include/c++/7.3.0/bits/invoke.h:95
      #22 __call<void, 0, 1> (__args=<optimized out>, this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:467
      #23 operator()<> (this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:551
      #24 folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) (p=...) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:401
      #25 0x0000000000a9a3a4 in operator() (this=0x7fe730144b80) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
      #26 operator() (__closure=0x7fe730144b80) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
      #27 folly::detail::function::FunctionTraits<void ()>::callBig<CBRegisteredThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::detail::function::Data&) (p=...)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/server_build/tlm/deps/folly.exploded/include/folly/Function.h:401
      #28 0x00007fe737baad40 in execute_native_thread_routine () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/thread.cc:80
      #29 0x00007fe7399b2ea5 in start_thread () from /lib64/libpthread.so.0
      #30 0x00007fe7372f39fd in clone () from /lib64/libc.so.6
      

      Attachments

        For Gerrit Dashboard: MB-51608
        # Subject Branch Project Status CR V

        Activity

          sarath Sarath Lakshman created issue -
          sarath Sarath Lakshman made changes -
          Field Original Value New Value
          Environment 7.1.0-2506
          sarath Sarath Lakshman made changes -
          Description Several crashes are observed in a magma 20 bucket throughput test. A 10 bucket variant of the test doesn't crash.

          {{je_mallctl("tcache.create",..)}} is returning 14 (EFAULT).

          *Test run:*
           http://perf.jenkins.couchbase.com/job/rhea-dev2/130/console

          *BackTrace:*
          {code}
          (gdb) bt
          #0 0x00007fe73722b3d7 in raise () from /lib64/libc.so.6
          #1 0x00007fe73722cac8 in abort () from /lib64/libc.so.6
          #2 0x00007fe737b7663c in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
          #3 0x0000000000b3255b in backtrace_terminate_handler() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:88
          #4 0x00007fe737b818f6 in __cxxabiv1::__terminate(void (*)()) () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
          #5 0x00007fe737b81961 in std::terminate () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
          #6 0x00007fe737b81bf4 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x107f4e0 <typeinfo for std::runtime_error>, dest=0x444ef0 <_ZNSt13runtime_errorD1Ev@plt>)
              at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_throw.cc:95
          #7 0x000000000053c64f in cb::ThreadLocalData::getTCacheID (this=0x7fe4aefedac0, client=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/je_arena_malloc.cc:373
          #8 0x0000000000bab3cb in cb::_JEArenaMalloc<cb::JEArenaCoreLocalTracker>::switchToClient(cb::ArenaMallocClient const&, cb::MemoryDomain, bool) ()
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/je_arena_malloc.cc:225
          #9 0x000000000074f2b0 in switchToClient (tcache=true, domain=cb::Primary, client=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_engine.h:855
          #10 switchToEngine (domain=cb::Primary, want_old_thread_local=true, engine=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/objectregistry.cc:123
          #11 onSwitchThread (domain=cb::Primary, want_old_thread_local=true, engine=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/objectregistry.cc:136
          #12 BucketAllocationGuard::BucketAllocationGuard (this=0x7fe4aefeb3a0, engine=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/objectregistry.cc:148
          #13 0x0000000000aa0f85 in GlobalTask::execute(std::basic_string_view<char, std::char_traits<char> >) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/globaltask.cc:72
          #14 0x0000000000a9a6aa in FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}::operator()() const (__closure=0x7fe4aefeb650)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:309
          #15 0x0000000000aa238e in operator() (this=0x7fe4aefeb650) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
          #16 CancellableCPUExecutor::add(GlobalTask*, folly::Function<void ()>)::{lambda()#1}::operator()() const () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
          #17 0x0000000000bf9a90 in operator() (this=0x7fe4aefeb840) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:416
          #18 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&) (this=this@entry=0x7fe734ddf900, thread=...,
              task=task@entry=<unknown type in /usr/lib/debug/opt/couchbase/bin/memcached-7.1.0-2506.x86_64.debug, CU 0xa40c0ed, DIE 0xa490022>)
              at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/ThreadPoolExecutor.cpp:97
          #19 0x0000000000be452a in folly::CPUThreadPoolExecutor::threadRun (this=0x7fe734ddf900, thread=...)
              at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/CPUThreadPoolExecutor.cpp:265
          #20 0x0000000000bfca49 in __invoke_impl<void, void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (
              __t=<optimized out>, __f=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:73
          #21 __invoke<void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__fn=<optimized out>)
              at /usr/local/include/c++/7.3.0/bits/invoke.h:95
          #22 __call<void, 0, 1> (__args=<optimized out>, this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:467
          #23 operator()<> (this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:551
          #24 folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) (p=...) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:401
          #25 0x0000000000a9a3a4 in operator() (this=0x7fe730144b80) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
          #26 operator() (__closure=0x7fe730144b80) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
          #27 folly::detail::function::FunctionTraits<void ()>::callBig<CBRegisteredThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::detail::function::Data&) (p=...)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/server_build/tlm/deps/folly.exploded/include/folly/Function.h:401
          #28 0x00007fe737baad40 in execute_native_thread_routine () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/thread.cc:80
          #29 0x00007fe7399b2ea5 in start_thread () from /lib64/libpthread.so.0
          #30 0x00007fe7372f39fd in clone () from /lib64/libc.so.6
          {code}
          Several crashes are observed in a magma 20 bucket throughput test. A 10 bucket variant of the test doesn't crash.

          The test creates 20 buckets, loads 50M docs each, then overwrite 50M docs and finally run an access phase of 50:50 read-writes. The crashes are observed during the access phase.

          {{je_mallctl("tcache.create",..)}} is returning 14 (EFAULT).

          *Test run:*
           http://perf.jenkins.couchbase.com/job/rhea-dev2/130/console

          *BackTrace:*
          {code}
          (gdb) bt
          #0 0x00007fe73722b3d7 in raise () from /lib64/libc.so.6
          #1 0x00007fe73722cac8 in abort () from /lib64/libc.so.6
          #2 0x00007fe737b7663c in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
          #3 0x0000000000b3255b in backtrace_terminate_handler() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:88
          #4 0x00007fe737b818f6 in __cxxabiv1::__terminate(void (*)()) () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
          #5 0x00007fe737b81961 in std::terminate () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
          #6 0x00007fe737b81bf4 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x107f4e0 <typeinfo for std::runtime_error>, dest=0x444ef0 <_ZNSt13runtime_errorD1Ev@plt>)
              at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_throw.cc:95
          #7 0x000000000053c64f in cb::ThreadLocalData::getTCacheID (this=0x7fe4aefedac0, client=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/je_arena_malloc.cc:373
          #8 0x0000000000bab3cb in cb::_JEArenaMalloc<cb::JEArenaCoreLocalTracker>::switchToClient(cb::ArenaMallocClient const&, cb::MemoryDomain, bool) ()
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/je_arena_malloc.cc:225
          #9 0x000000000074f2b0 in switchToClient (tcache=true, domain=cb::Primary, client=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_engine.h:855
          #10 switchToEngine (domain=cb::Primary, want_old_thread_local=true, engine=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/objectregistry.cc:123
          #11 onSwitchThread (domain=cb::Primary, want_old_thread_local=true, engine=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/objectregistry.cc:136
          #12 BucketAllocationGuard::BucketAllocationGuard (this=0x7fe4aefeb3a0, engine=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/objectregistry.cc:148
          #13 0x0000000000aa0f85 in GlobalTask::execute(std::basic_string_view<char, std::char_traits<char> >) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/globaltask.cc:72
          #14 0x0000000000a9a6aa in FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}::operator()() const (__closure=0x7fe4aefeb650)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:309
          #15 0x0000000000aa238e in operator() (this=0x7fe4aefeb650) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
          #16 CancellableCPUExecutor::add(GlobalTask*, folly::Function<void ()>)::{lambda()#1}::operator()() const () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/cancellable_cpu_executor.cc:42
          #17 0x0000000000bf9a90 in operator() (this=0x7fe4aefeb840) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:416
          #18 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&) (this=this@entry=0x7fe734ddf900, thread=...,
              task=task@entry=<unknown type in /usr/lib/debug/opt/couchbase/bin/memcached-7.1.0-2506.x86_64.debug, CU 0xa40c0ed, DIE 0xa490022>)
              at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/ThreadPoolExecutor.cpp:97
          #19 0x0000000000be452a in folly::CPUThreadPoolExecutor::threadRun (this=0x7fe734ddf900, thread=...)
              at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/CPUThreadPoolExecutor.cpp:265
          #20 0x0000000000bfca49 in __invoke_impl<void, void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (
              __t=<optimized out>, __f=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:73
          #21 __invoke<void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__fn=<optimized out>)
              at /usr/local/include/c++/7.3.0/bits/invoke.h:95
          #22 __call<void, 0, 1> (__args=<optimized out>, this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:467
          #23 operator()<> (this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:551
          #24 folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) (p=...) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:401
          #25 0x0000000000a9a3a4 in operator() (this=0x7fe730144b80) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
          #26 operator() (__closure=0x7fe730144b80) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:49
          #27 folly::detail::function::FunctionTraits<void ()>::callBig<CBRegisteredThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::detail::function::Data&) (p=...)
              at /home/couchbase/jenkins/workspace/couchbase-server-unix/server_build/tlm/deps/folly.exploded/include/folly/Function.h:401
          #28 0x00007fe737baad40 in execute_native_thread_routine () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/thread.cc:80
          #29 0x00007fe7399b2ea5 in start_thread () from /lib64/libpthread.so.0
          #30 0x00007fe7372f39fd in clone () from /lib64/libc.so.6
          {code}
          owend Daniel Owen made changes -
          Assignee Daniel Owen [ owend ] Dave Rigby [ drigby ]
          sarath Sarath Lakshman made changes -
          Is this a Regression? Unknown [ 10452 ] No [ 10451 ]
          owend Daniel Owen made changes -
          Priority Major [ 3 ] Critical [ 2 ]
          owend Daniel Owen added a comment -

          From a google it looks like 14 (EFAULT) means

          "An invalid user space address was specified for an argument."

          owend Daniel Owen added a comment - From a google it looks like 14 (EFAULT) means "An invalid user space address was specified for an argument."
          jwalker Jim Walker added a comment - - edited

          Note 14 EFAULT is not a system call failing, just jemalloc reusing the system error codes for its own errors.

          In this case it looks very much like we've ran out of thread caches, this has occurred because in this test (from slack discussion) 164 threads were created, and we have 20 buckets.

          • Each bucket requires its own thread-cache for each thread - so 20 * 164 thread caches will be required.
          • It's not clear what the current limit is, but jemalloc itself looks like it has a hard max of 4093
          jwalker Jim Walker added a comment - - edited Note 14 EFAULT is not a system call failing, just jemalloc reusing the system error codes for its own errors. In this case it looks very much like we've ran out of thread caches, this has occurred because in this test (from slack discussion ) 164 threads were created, and we have 20 buckets. Each bucket requires its own thread-cache for each thread - so 20 * 164 thread caches will be required. It's not clear what the current limit is, but jemalloc itself looks like it has a hard max of 4093
          drigby Dave Rigby added a comment - - edited

          jemalloc only supports up to 4093 explicitly-created tcaches. This is not explicitly documented in the man page (that I can see), but poking in the source we can see that the tcaches_create_prep function checks that the index of the created tcache does not exceed MALLOCX_TCACHE_MAX:

          static bool
          tcaches_create_prep(tsd_t *tsd) {
          	bool err;
           
          	malloc_mutex_lock(tsd_tsdn(tsd), &tcaches_mtx);
           
          	if (tcaches == NULL) {
          		tcaches = base_alloc(tsd_tsdn(tsd), b0get(), sizeof(tcache_t *)
          		    * (MALLOCX_TCACHE_MAX+1), CACHELINE);
          		if (tcaches == NULL) {
          			err = true;
          			goto label_return;
          		}
          	}
           
          	if (tcaches_avail == NULL && tcaches_past > MALLOCX_TCACHE_MAX) {
          		err = true;
          		goto label_return;
          	}
          

          MALLOCX_TCACHE_MAX is defined as:

          #define MALLOCX_TCACHE_BITS	12
          ...
          #define MALLOCX_TCACHE_MAX	((1 << MALLOCX_TCACHE_BITS) - 3)
          

          So 12 bits of tcache ID minus a couple of reserved IDs gives 4093 possible tcaches.

          Hypothesis is that we are exceeding this limit, given that we create a tcache per Bucket per thread which allocates / deallocates memory for that bucket. In this case there are 20 buckets, and the core dump shows 164 threads - currently 3280 tcaches. Not sure if there's also memory limits we are hitting, but certainly we appear to be in within the ballpark of the maximum...

          drigby Dave Rigby added a comment - - edited jemalloc only supports up to 4093 explicitly-created tcaches. This is not explicitly documented in the man page (that I can see), but poking in the source we can see that the tcaches_create_prep function checks that the index of the created tcache does not exceed MALLOCX_TCACHE_MAX : static bool tcaches_create_prep(tsd_t *tsd) { bool err;   malloc_mutex_lock(tsd_tsdn(tsd), &tcaches_mtx);   if (tcaches == NULL) { tcaches = base_alloc(tsd_tsdn(tsd), b0get(), sizeof (tcache_t *) * (MALLOCX_TCACHE_MAX+1), CACHELINE); if (tcaches == NULL) { err = true ; goto label_return; } }   if (tcaches_avail == NULL && tcaches_past > MALLOCX_TCACHE_MAX) { err = true ; goto label_return; } MALLOCX_TCACHE_MAX is defined as: #define MALLOCX_TCACHE_BITS 12 ... #define MALLOCX_TCACHE_MAX ((1 << MALLOCX_TCACHE_BITS) - 3) So 12 bits of tcache ID minus a couple of reserved IDs gives 4093 possible tcaches. Hypothesis is that we are exceeding this limit, given that we create a tcache per Bucket per thread which allocates / deallocates memory for that bucket. In this case there are 20 buckets, and the core dump shows 164 threads - currently 3280 tcaches. Not sure if there's also memory limits we are hitting, but certainly we appear to be in within the ballpark of the maximum...
          drigby Dave Rigby added a comment -

          Assigning to Jim Walker as he's the most familiar with our tcache usage.

          drigby Dave Rigby added a comment - Assigning to Jim Walker as he's the most familiar with our tcache usage.
          drigby Dave Rigby made changes -
          Assignee Dave Rigby [ drigby ] Jim Walker [ jwalker ]
          owend Daniel Owen added a comment -

          machines they are 56 core. Therefore suspect the issue is high CPU core and high bucket counts.

          owend Daniel Owen added a comment - machines they are 56 core. Therefore suspect the issue is high CPU core and high bucket counts.
          jwalker Jim Walker added a comment -

          I've found one minor issue (at least appears minor at the moment) in that we waste 1 tcache, some debugging with our unit tests here and it appears that jemalloc does make use of id:0 for a valid tcache, whereas we treat id:0 as "no-tcache".

          This appears to mean that the very first tcache we create, results in id:0, we store that as the current tcache, and then later we think no tcache exists and allocate one more.

          Not sure if this is an oversight with the original implementation, or a change of behaviour in jemalloc - I suspect an oversight leading from the fact that arena:0 is a reserved value... so thinking same applied to tcache.

          jwalker Jim Walker added a comment - I've found one minor issue (at least appears minor at the moment) in that we waste 1 tcache, some debugging with our unit tests here and it appears that jemalloc does make use of id:0 for a valid tcache, whereas we treat id:0 as "no-tcache". This appears to mean that the very first tcache we create, results in id:0, we store that as the current tcache, and then later we think no tcache exists and allocate one more. Not sure if this is an oversight with the original implementation, or a change of behaviour in jemalloc - I suspect an oversight leading from the fact that arena:0 is a reserved value... so thinking same applied to tcache.
          jwalker Jim Walker added a comment - - edited

          From couchbase.log I see memcached's with 227 threads, so that will explain the exhaustion.

          See NLWP column (trimmed output from ps below)

          USER        PID    LWP   PPID NLWP %CPU  MAJFL  MINFL PRI  NI    VSZ   RSS TT       STAT WCHAN         STARTED   TIME COMMAND         COMMAND
          couchba+ 173559 173559   7501  227  2.4     61  14541  19   0 37355348 16959844 ?   SLsl ep_poll      06:49:29   1:45 memcached       /opt/couchbase/bin/memcached...
          

          Note that only threads which will execute a bucket (using some form of ArenaMalloc::switchToClient will allocate a tcache for that bucket, so threads like mc:check_stdin won't consume a tcache, but threads like mc:worker_47 will.

          A quick grep -v to remove the threads which shouldn't execute a bucket, leaves about 217 threads, which will eventually try and use 217*20 tcaches, way over the 4093 limitation we have. threads.txt

          4 threads I can't be sure if they would create a tcache without further checks

          It looks like we will have to limit the total number of threads we create to keep under the limit - to support 30 buckets we'll have to cap the total threads at about 130

          jwalker Jim Walker added a comment - - edited From couchbase.log I see memcached's with 227 threads, so that will explain the exhaustion. See NLWP column (trimmed output from ps below) USER PID LWP PPID NLWP %CPU MAJFL MINFL PRI NI VSZ RSS TT STAT WCHAN STARTED TIME COMMAND COMMAND couchba+ 173559 173559 7501 227 2.4 61 14541 19 0 37355348 16959844 ? SLsl ep_poll 06:49:29 1:45 memcached /opt/couchbase/bin/memcached... Note that only threads which will execute a bucket (using some form of ArenaMalloc::switchToClient will allocate a tcache for that bucket, so threads like mc:check_stdin won't consume a tcache, but threads like mc:worker_47 will. A quick grep -v to remove the threads which shouldn't execute a bucket, leaves about 217 threads, which will eventually try and use 217*20 tcaches, way over the 4093 limitation we have. threads.txt 4 threads I can't be sure if they would create a tcache without further checks It looks like we will have to limit the total number of threads we create to keep under the limit - to support 30 buckets we'll have to cap the total threads at about 130
          jwalker Jim Walker made changes -
          Attachment threads.txt [ 180509 ]
          owend Daniel Owen made changes -
          Sprint KV March-22 [ 2050 ]
          owend Daniel Owen made changes -
          Rank Ranked higher
          owend Daniel Owen made changes -
          Due Date 01/Apr/22
          jwalker Jim Walker added a comment - - edited

          Crash is fairly simple to reproduce. The system I used only has 24 cores, so I needed to set the "Reader Thread Settings" and "Writer Thread Settings" to 64. If anyone has a system with 64 cores, setting to disk i/o optimised will also work

          • reproduction only requires a single node
          • Go into "Settings" -> "Advanced Data Settings" and change "Reader Thread Settings" and "Writer Thread Settings"
            • note threads won't be created until KV-engine does some work, here optionally restart KV to see how many threads get created.
            • Can validate KV-engine threads via ps, e.g. ps -AwwL -o nlwp,command | grep memcached shows 178 on my test system
            • What ever LWP says, it 30 * NLWP must exceed 4093 to repro the bug.
          • create 30 full eviction buckets (I used 256 mb ram each), all were couchstore for this test
          • Next loop pillowfight over each bucket, here 10k documents are mutated/read and I use 64 threads/connections

          for X in $(seq 0 29); do ./cbc-pillowfight -U "couchbase://localhost:12202/b${X}" -u Administrator -P asdasd -m 104 -M 104  -I 10000  -t 64 -c 100    ; done
          

          Shortly after kv crashes

          jwalker Jim Walker added a comment - - edited Crash is fairly simple to reproduce. The system I used only has 24 cores, so I needed to set the "Reader Thread Settings" and "Writer Thread Settings" to 64. If anyone has a system with 64 cores, setting to disk i/o optimised will also work reproduction only requires a single node Go into "Settings" -> "Advanced Data Settings" and change "Reader Thread Settings" and "Writer Thread Settings" note threads won't be created until KV-engine does some work, here optionally restart KV to see how many threads get created. Can validate KV-engine threads via ps, e.g. ps -AwwL -o nlwp,command | grep memcached shows 178 on my test system What ever LWP says, it 30 * NLWP must exceed 4093 to repro the bug. create 30 full eviction buckets (I used 256 mb ram each), all were couchstore for this test Next loop pillowfight over each bucket, here 10k documents are mutated/read and I use 64 threads/connections for X in $(seq 0 29); do ./cbc-pillowfight -U "couchbase://localhost:12202/b${X}" -u Administrator -P asdasd -m 104 -M 104 -I 10000 -t 64 -c 100 ; done Shortly after kv crashes
          ritam.sharma Ritam Sharma added a comment -

          Based on the conversation with scrum team - PM is okay for this defect to be release noted for customers and get the bug fixed for 7.1.1.

          Changing fixVersion to 7.1.1 and adding label for release notes, while team continues to review the fixes.

          CC - John Liang Raju Suravarjjala Mary Roth Lynn Straus

          ritam.sharma Ritam Sharma added a comment - Based on the conversation with scrum team - PM is okay for this defect to be release noted for customers and get the bug fixed for 7.1.1. Changing fixVersion to 7.1.1 and adding label for release notes, while team continues to review the fixes. CC - John Liang Raju Suravarjjala Mary Roth Lynn Straus
          owend Daniel Owen made changes -
          Labels magma magma releasenote
          owend Daniel Owen made changes -
          Fix Version/s 7.1.1 [ 18320 ]
          Fix Version/s Neo [ 17615 ]

          Build couchbase-server-7.2.0-1038 contains platform commit c5a6a3b with commit message:
          MB-51608: Use automatic jemalloc tcache selection

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1038 contains platform commit c5a6a3b with commit message: MB-51608 : Use automatic jemalloc tcache selection
          drigby Dave Rigby made changes -
          Link This issue blocks MB-51648 [ MB-51648 ]
          drigby Dave Rigby made changes -
          Sprint KV March-22 [ 2050 ] KV Post-Neo, KV March-22 [ 2016, 2050 ]
          drigby Dave Rigby made changes -
          Rank Ranked higher
          wayne Wayne Siu made changes -
          Labels magma releasenote approved-for-7.1.1 magma releasenote
          drigby Dave Rigby made changes -
          Sprint KV Post-Neo (April), KV March-22 [ 2016, 2050 ] KV March-22, KV May 22 [ 2050, 2128 ]

          Build couchbase-server-7.2.0-1158 contains platform commit 63e7c8a with commit message:
          MB-51608: Remove dead tcache code

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1158 contains platform commit 63e7c8a with commit message: MB-51608 : Remove dead tcache code

          Build couchbase-server-7.1.1-3073 contains platform commit ebae7a1 with commit message:
          MB-51608: [BP] Use automatic jemalloc tcache selection

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.1-3073 contains platform commit ebae7a1 with commit message: MB-51608 : [BP] Use automatic jemalloc tcache selection

          Build couchbase-server-7.2.0-1232 contains platform commit ebae7a1 with commit message:
          MB-51608: [BP] Use automatic jemalloc tcache selection

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-1232 contains platform commit ebae7a1 with commit message: MB-51608 : [BP] Use automatic jemalloc tcache selection

          Build couchbase-server-8.0.0-1004 contains platform commit ebae7a1 with commit message:
          MB-51608: [BP] Use automatic jemalloc tcache selection

          build-team Couchbase Build Team added a comment - Build couchbase-server-8.0.0-1004 contains platform commit ebae7a1 with commit message: MB-51608 : [BP] Use automatic jemalloc tcache selection
          wayne Wayne Siu made changes -
          Link This issue blocks MB-52510 [ MB-52510 ]
          lynn.straus Lynn Straus made changes -
          Fix Version/s 7.1.2 [ 18414 ]
          Fix Version/s 7.1.1 [ 18320 ]
          lynn.straus Lynn Straus made changes -
          Labels approved-for-7.1.1 magma releasenote approved-for-7.1.1 approved-for-7.1.2 magma releasenote
          lynn.straus Lynn Straus made changes -
          Link This issue blocks MB-52510 [ MB-52510 ]
          jwalker Jim Walker made changes -
          Assignee Jim Walker [ jwalker ] Sarath Lakshman [ sarath ]
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          ritam.sharma Ritam Sharma made changes -
          Assignee Sarath Lakshman [ sarath ] Wayne Siu [ wayne ]
          ritam.sharma Ritam Sharma made changes -
          Labels approved-for-7.1.1 approved-for-7.1.2 magma releasenote approved-for-7.1.1 approved-for-7.1.2 magma perf releasenote
          wayne Wayne Siu made changes -
          Assignee Wayne Siu [ wayne ] Bo-Chun Wang [ bo-chun.wang ]
          wayne Wayne Siu added a comment -

          Bo-Chun Wang
          Can you help verify the fix? Thanks.

          wayne Wayne Siu added a comment - Bo-Chun Wang Can you help verify the fix? Thanks.

          I have a good run on build 7.1.2-3419. I close this ticket.

          http://perf.jenkins.couchbase.com/job/rhea-dev2/183/

           

          bo-chun.wang Bo-Chun Wang added a comment - I have a good run on build 7.1.2-3419. I close this ticket. http://perf.jenkins.couchbase.com/job/rhea-dev2/183/  
          bo-chun.wang Bo-Chun Wang made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            bo-chun.wang Bo-Chun Wang
            sarath Sarath Lakshman
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                PagerDuty