Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39692

[Collections] Memcached crashes seen during rebalance-in op + durability data load

    XMLWordPrintable

Details

    Description

      Summary:

      Memcached crashes seen during rebalance-in op with durability(persist_to_majority) data load

      Script to Repo:

      ./testrunner -i /tmp/durability_volume.ini sdk_client_pool=True,rerun=False,get-cbcollect-info=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_rebalance_in,nodes_init=3,nodes_in=2,override_spec_params=durability;replicas,durability=PERSIST_TO_MAJORITY,replicas=Bucket.ReplicaNum.TWO,bucket_spec=multi_bucket.buckets_all_membase_for_rebalance_tests,data_load_stage=before,GROUP=durability_persist_to_majority

      Steps to reproduce:
      1. Create a 3 node cluster
      -----------------------++-------------

      Nodes Services Status

      -----------------------++-------------

      172.23.105.211 kv Cluster node
      172.23.105.212 None <--- IN —
      172.23.105.213 None <--- IN —

      -----------------------++-------------

      2.  Create buckets + initial data load
      ---------------++-------------------------------+---------------------

      Bucket Type Replicas TTL Items RAM Quota RAM Used Disk Used

      ---------------++-------------------------------+---------------------

      bucket1 membase 2 0 30000 314572800 111739840 238620240
      bucket2 membase 2 0 30000 314572800 101204560 386682612
      default membase 2 0 500000 4718592000 406302544 384156314

      ---------------++-------------------------------+---------------------

      3.  Start data load again 
      2020-06-01 17:46:19,364 | test | INFO | MainProcess | MainThread | [collections_rebalance:load_collections_with_rebalance:528] Doing collection data load before rebalance_in

      4. Start rebalance-in operation

      2020-06-01 17:47:24,970 | test | INFO | MainProcess | pool-23-thread-21 | [table_view:display:72] Rebalance Overview
      -----------------------++-------------

      Nodes Services Status

      -----------------------++-------------

      172.23.105.212 kv Cluster node
      172.23.105.213 kv Cluster node
      172.23.105.211 kv Cluster node
      172.23.105.215 None <--- IN —
      172.23.105.217 None <--- IN —

      -----------------------++-------------

      This rebalance operation fails.

      A total of 8 Coredumps are seen. All, except the coredump on .211 are same as the ones we see in https://issues.couchbase.com/browse/MB-39272
      Coredump on .211 looks different

      (gdb) bt full
      #0  __GI___pthread_mutex_lock (mutex=0x2e65746972776063) at ../nptl/pthread_mutex_lock.c:65
              type = <optimized out>
              id = <optimized out>
      #1  0x000000000046a1b8 in __gthread_mutex_lock (__mutex=0x2e65746972776063) at /usr/local/include/c++/7.3.0/x86_64-pc-linux-gnu/bits/gthr-default.h:748
      No locals.
      #2  lock (this=<optimized out>) at /usr/local/include/c++/7.3.0/bits/std_mutex.h:103
      No locals.
      #3  lock_guard (__m=..., this=<synthetic pointer>) at /usr/local/include/c++/7.3.0/bits/std_mutex.h:162
      No locals.
      #4  add_conn_to_pending_io_list (c=0x7f59c695f100, cookie=cookie@entry=0x7f5978524c00, status=ENGINE_SUCCESS) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/thread.cc:483
      No locals.
      #5  0x000000000046a91f in notify_io_complete (void_cookie=..., status=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/thread.cc:349
              ccookie = <optimized out>
              cookie = <optimized out>
      #6  0x00007f59cbfd05e8 in EventuallyPersistentEngine::notifyIOComplete (this=0x7f5988100000, cookie=0x7f5978524c00, status=status@entry=ENGINE_SUCCESS)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_engine.cc:6255
              bt = {dest = 0x7f5988100470, start = {__d = {__r = 3431165626269461}}, name = 0x0, out = 0x0}
              guard = {engine = 0x7f5988100000}
      #7  0x00007f59cbf2fbb0 in ConnMap::processPendingNotifications (this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/connmap.cc:174
              conn = {<std::__shared_ptr<ConnHandler, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ConnHandler, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = <optimized out>, _M_refcount = {
                    _M_pi = 0x7f5979721700}}, <No data fields>}
              queue = {c = {<std::_Deque_base<std::weak_ptr<ConnHandler>, std::allocator<std::weak_ptr<ConnHandler> > >> = {
                    _M_impl = {<std::allocator<std::weak_ptr<ConnHandler> >> = {<__gnu_cxx::new_allocator<std::weak_ptr<ConnHandler> >> = {<No data fields>}, <No data fields>}, _M_map = 0x7f597887b340, _M_map_size = 8, _M_start = {
                        _M_cur = 0x7f5978d77e00, _M_first = 0x7f5978d77e00, _M_last = 0x7f5978d78000, _M_node = 0x7f597887b358}, _M_finish = {_M_cur = 0x7f5978d77e10, _M_first = 0x7f5978d77e00, _M_last = 0x7f5978d78000, 
                        _M_node = 0x7f597887b358}}}, <No data fields>}}
              phosphor_internal_category_enabled_164 = {_M_b = {_M_p = 0x0}, static is_always_lock_free = <error reading variable: No global symbol "std::atomic<std::atomic<phosphor::CategoryStatus> const*>::is_always_lock_free".>}
              phosphor_internal_category_enabled_temp_164 = <optimized out>
              phosphor_internal_tpi_164 = {category = 0x29639f <Address 0x29639f out of bounds>, name = 0x2963bc <Address 0x2963bc out of bounds>, type = phosphor::Complete, argument_names = {_M_elems = {
                    0x2963d8 <Address 0x2963d8 out of bounds>, 0x2bf97b <Address 0x2bf97b out of bounds>}}, argument_types = {_M_elems = {phosphor::is_uint, phosphor::is_none}}}
              phosphor_internal_guard_164 = {tpi = 0x7f59cc40fda0 <ConnMap::processPendingNotifications()::phosphor_internal_tpi_164>, enabled = true, arg1 = 1, arg2 = {<No data fields>}, start = {__d = {__r = 3431165626266788}}}
              phosphor_internal_category_enabled_169 = {_M_b = {_M_p = 0x0}, static is_always_lock_free = <error reading variable: No global symbol "std::atomic<std::atomic<phosphor::CategoryStatus> const*>::is_always_lock_free".>}
              phosphor_internal_category_enabled_temp_169 = <optimized out>
              phosphor_internal_tpi_wait_169 = {category = 0x2963b1 <Address 0x2963b1 out of bounds>, name = 0x296368 <Address 0x296368 out of bounds>, type = phosphor::Complete, argument_names = {_M_elems = {
                    0x2963b7 <Address 0x2963b7 out of bounds>, 0x2bf97b <Address 0x2bf97b out of bounds>}}, argument_types = {_M_elems = {phosphor::is_pointer, phosphor::is_none}}}
              phosphor_internal_tpi_held_169 = {category = 0x2963b1 <Address 0x2963b1 out of bounds>, name = 0x296330 <Address 0x296330 out of bounds>, type = phosphor::Complete, argument_names = {_M_elems = {
                    0x2bf97b <Address 0x2bf97b out of bounds>, 0x2bf97b <Address 0x2bf97b out of bounds>}}, argument_types = {_M_elems = {phosphor::is_pointer, phosphor::is_none}}}
              phosphor_internal_guard_169 = {tpiWait = 0x7f59cc40fd60 <ConnMap::processPendingNotifications()::phosphor_internal_tpi_wait_169>, tpiHeld = 0x7f59cc40fd20 <ConnMap::processPendingNotifications()::phosphor_internal_tpi_held_169>, 
                enabled = true, mutex = @0x7f598816f008, threshold = {__r = 10000000}, start = {__d = {__r = 3431165626267584}}, lockedAt = {__d = {__r = 3431165626268456}}, releasedAt = {__d = {__r = 0}}}
      #8  0x00007f59cbf2ca77 in notifyConnections (this=0x7f59882ff290) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/conn_notifier.cc:92
              inverse = false
      #9  ConnNotifierCallback::run (this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/conn_notifier.cc:39
              connNotifier = {<std::__shared_ptr<ConnNotifier, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ConnNotifier, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = <optimized out>, _M_refcount = {
                    _M_pi = 0x7f59882ff280}}, <No data fields>}
      #10 0x00007f59cc006be3 in GlobalTask::execute (this=0x7f59881548b0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/globaltask.cc:73
              guard = {previous = 0x0}
      #11 0x00007f59cbfff48f in ExecutorThread::run (this=0x7f59c69bb960) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:188
              curTaskDescr = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
                  _M_p = 0x7f59c6887c60 <Address 0x7f59c6887c60 out of bounds>}, _M_string_length = 23, {_M_local_buf = "\027\000\000\000\000\000\000\000pressor", _M_allocated_capacity = 23}}
              woketime = <optimized out>
              scheduleOverhead = <optimized out>
              again = <optimized out>
              runtime = <optimized out>
              q = <optimized out>
              tick = 198 '\306'
              guard = {engine = 0x0}
      #12 0x00007f59caa10777 in run (this=0x7f59c764c0d0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
      No locals.
      #13 platform_thread_wrap (arg=0x7f59c764c0d0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
              context = {_M_t = {
                  _M_t = {<std::_Tuple_impl<0, CouchbaseThread*, std::default_delete<CouchbaseThread> >> = {<std::_Tuple_impl<1, std::default_delete<CouchbaseThread> >> = {<std::_Head_base<1, std::default_delete<CouchbaseThread>, true>> = {<std::default_delete<CouchbaseThread>> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, CouchbaseThread*, false>> = {_M_head_impl = 0x7f59c764c0d0}, <No data fields>}, <No data fields>}}}
      ---Type <return> to continue, or q <return> to quit---
      #14 0x00007f59c804dea5 in start_thread (arg=0x7f598aff5700) at pthread_create.c:307
              __res = <optimized out>
              pd = 0x7f598aff5700
              now = <optimized out>
              unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140022560806656, 1896906296006368233, 0, 8392704, 0, 140022560806656, -1954492124063245335, -1954357521255539735}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {
                    prev = 0x0, cleanup = 0x0, canceltype = 0}}}
              not_first_call = <optimized out>
              pagesize_m1 = <optimized out>
              sp = <optimized out>
              freesize = <optimized out>
      #15 0x00007f59c7d768dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
      No locals.
      

      From memcached log on .211 node:

      grep CRITICAL memcached.log
      2020-06-01T17:47:46.501335-07:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-2217). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/ca6dfb17-512c-458f-30425c8a-f9c3cde3.dmp before terminating.
      2020-06-01T17:47:46.501373-07:00 CRITICAL Stack backtrace of crashed thread:
      2020-06-01T17:47:46.502284-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x1397ad]
      2020-06-01T17:47:46.502308-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x14f4fa]
      2020-06-01T17:47:46.502318-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x14f838]
      2020-06-01T17:47:46.502325-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7f59c8046000+0xf630]
      2020-06-01T17:47:46.502332-07:00 CRITICAL     /lib64/libpthread.so.0(pthread_mutex_lock+0) [0x7f59c8046000+0x9d00]
      2020-06-01T17:47:46.502342-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x6a1b8]
      2020-06-01T17:47:46.502350-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x6a91f]
      2020-06-01T17:47:46.502362-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x10f5e8]
      2020-06-01T17:47:46.502371-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x6ebb0]
      2020-06-01T17:47:46.502379-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x6ba77]
      2020-06-01T17:47:46.502388-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x145be3]
      2020-06-01T17:47:46.502394-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x13e48f]
      2020-06-01T17:47:46.502400-07:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f59caa00000+0x10777]
      2020-06-01T17:47:46.502406-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7f59c8046000+0x7ea5]
      2020-06-01T17:47:46.502438-07:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f59c7c78000+0xfe8dd]
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              sumedh.basarkod Sumedh Basarkod (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty