Details
-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
Cheshire-Cat
-
Centos-7 64 bit; Couchbase Enterprise Build 7.0.0-2217
-
Triaged
-
Centos 64-bit
-
1
-
No
Description
Summary:
Memcached crashes seen during rebalance-in op with durability(persist_to_majority) data load
Script to Repo:
./testrunner -i /tmp/durability_volume.ini sdk_client_pool=True,rerun=False,get-cbcollect-info=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_rebalance_in,nodes_init=3,nodes_in=2,override_spec_params=durability;replicas,durability=PERSIST_TO_MAJORITY,replicas=Bucket.ReplicaNum.TWO,bucket_spec=multi_bucket.buckets_all_membase_for_rebalance_tests,data_load_stage=before,GROUP=durability_persist_to_majority |
Steps to reproduce:
1. Create a 3 node cluster
-----------------------++-------------
Nodes | Services | Status |
-----------------------++-------------
172.23.105.211 | kv | Cluster node |
172.23.105.212 | None | <--- IN — |
172.23.105.213 | None | <--- IN — |
-----------------------++-------------
2. Create buckets + initial data load
---------------++-------------------------------+---------------------
Bucket | Type | Replicas | TTL | Items | RAM Quota | RAM Used | Disk Used |
---------------++-------------------------------+---------------------
bucket1 | membase | 2 | 0 | 30000 | 314572800 | 111739840 | 238620240 |
bucket2 | membase | 2 | 0 | 30000 | 314572800 | 101204560 | 386682612 |
default | membase | 2 | 0 | 500000 | 4718592000 | 406302544 | 384156314 |
---------------++-------------------------------+---------------------
3. Start data load again
2020-06-01 17:46:19,364 | test | INFO | MainProcess | MainThread | [collections_rebalance:load_collections_with_rebalance:528] Doing collection data load before rebalance_in
4. Start rebalance-in operation
2020-06-01 17:47:24,970 | test | INFO | MainProcess | pool-23-thread-21 | [table_view:display:72] Rebalance Overview
-----------------------++-------------
Nodes | Services | Status |
-----------------------++-------------
172.23.105.212 | kv | Cluster node |
172.23.105.213 | kv | Cluster node |
172.23.105.211 | kv | Cluster node |
172.23.105.215 | None | <--- IN — |
172.23.105.217 | None | <--- IN — |
-----------------------++-------------
This rebalance operation fails.
A total of 8 Coredumps are seen. All, except the coredump on .211 are same as the ones we see in https://issues.couchbase.com/browse/MB-39272
Coredump on .211 looks different
(gdb) bt full
|
#0 __GI___pthread_mutex_lock (mutex=0x2e65746972776063) at ../nptl/pthread_mutex_lock.c:65 |
type = <optimized out>
|
id = <optimized out>
|
#1 0x000000000046a1b8 in __gthread_mutex_lock (__mutex=0x2e65746972776063) at /usr/local/include/c++/7.3.0/x86_64-pc-linux-gnu/bits/gthr-default.h:748 |
No locals.
|
#2 lock (this=<optimized out>) at /usr/local/include/c++/7.3.0/bits/std_mutex.h:103 |
No locals.
|
#3 lock_guard (__m=..., this=<synthetic pointer>) at /usr/local/include/c++/7.3.0/bits/std_mutex.h:162 |
No locals.
|
#4 add_conn_to_pending_io_list (c=0x7f59c695f100, cookie=cookie@entry=0x7f5978524c00, status=ENGINE_SUCCESS) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/thread.cc:483 |
No locals.
|
#5 0x000000000046a91f in notify_io_complete (void_cookie=..., status=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/thread.cc:349 |
ccookie = <optimized out>
|
cookie = <optimized out>
|
#6 0x00007f59cbfd05e8 in EventuallyPersistentEngine::notifyIOComplete (this=0x7f5988100000, cookie=0x7f5978524c00, status=status@entry=ENGINE_SUCCESS) |
at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_engine.cc:6255 |
bt = {dest = 0x7f5988100470, start = {__d = {__r = 3431165626269461}}, name = 0x0, out = 0x0} |
guard = {engine = 0x7f5988100000} |
#7 0x00007f59cbf2fbb0 in ConnMap::processPendingNotifications (this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/connmap.cc:174 |
conn = {<std::__shared_ptr<ConnHandler, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ConnHandler, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = <optimized out>, _M_refcount = { |
_M_pi = 0x7f5979721700}}, <No data fields>} |
queue = {c = {<std::_Deque_base<std::weak_ptr<ConnHandler>, std::allocator<std::weak_ptr<ConnHandler> > >> = {
|
_M_impl = {<std::allocator<std::weak_ptr<ConnHandler> >> = {<__gnu_cxx::new_allocator<std::weak_ptr<ConnHandler> >> = {<No data fields>}, <No data fields>}, _M_map = 0x7f597887b340, _M_map_size = 8, _M_start = { |
_M_cur = 0x7f5978d77e00, _M_first = 0x7f5978d77e00, _M_last = 0x7f5978d78000, _M_node = 0x7f597887b358}, _M_finish = {_M_cur = 0x7f5978d77e10, _M_first = 0x7f5978d77e00, _M_last = 0x7f5978d78000, |
_M_node = 0x7f597887b358}}}, <No data fields>}} |
phosphor_internal_category_enabled_164 = {_M_b = {_M_p = 0x0}, static is_always_lock_free = <error reading variable: No global symbol "std::atomic<std::atomic<phosphor::CategoryStatus> const*>::is_always_lock_free".>} |
phosphor_internal_category_enabled_temp_164 = <optimized out>
|
phosphor_internal_tpi_164 = {category = 0x29639f <Address 0x29639f out of bounds>, name = 0x2963bc <Address 0x2963bc out of bounds>, type = phosphor::Complete, argument_names = {_M_elems = { |
0x2963d8 <Address 0x2963d8 out of bounds>, 0x2bf97b <Address 0x2bf97b out of bounds>}}, argument_types = {_M_elems = {phosphor::is_uint, phosphor::is_none}}} |
phosphor_internal_guard_164 = {tpi = 0x7f59cc40fda0 <ConnMap::processPendingNotifications()::phosphor_internal_tpi_164>, enabled = true, arg1 = 1, arg2 = {<No data fields>}, start = {__d = {__r = 3431165626266788}}} |
phosphor_internal_category_enabled_169 = {_M_b = {_M_p = 0x0}, static is_always_lock_free = <error reading variable: No global symbol "std::atomic<std::atomic<phosphor::CategoryStatus> const*>::is_always_lock_free".>} |
phosphor_internal_category_enabled_temp_169 = <optimized out>
|
phosphor_internal_tpi_wait_169 = {category = 0x2963b1 <Address 0x2963b1 out of bounds>, name = 0x296368 <Address 0x296368 out of bounds>, type = phosphor::Complete, argument_names = {_M_elems = { |
0x2963b7 <Address 0x2963b7 out of bounds>, 0x2bf97b <Address 0x2bf97b out of bounds>}}, argument_types = {_M_elems = {phosphor::is_pointer, phosphor::is_none}}} |
phosphor_internal_tpi_held_169 = {category = 0x2963b1 <Address 0x2963b1 out of bounds>, name = 0x296330 <Address 0x296330 out of bounds>, type = phosphor::Complete, argument_names = {_M_elems = { |
0x2bf97b <Address 0x2bf97b out of bounds>, 0x2bf97b <Address 0x2bf97b out of bounds>}}, argument_types = {_M_elems = {phosphor::is_pointer, phosphor::is_none}}} |
phosphor_internal_guard_169 = {tpiWait = 0x7f59cc40fd60 <ConnMap::processPendingNotifications()::phosphor_internal_tpi_wait_169>, tpiHeld = 0x7f59cc40fd20 <ConnMap::processPendingNotifications()::phosphor_internal_tpi_held_169>, |
enabled = true, mutex = @0x7f598816f008, threshold = {__r = 10000000}, start = {__d = {__r = 3431165626267584}}, lockedAt = {__d = {__r = 3431165626268456}}, releasedAt = {__d = {__r = 0}}} |
#8 0x00007f59cbf2ca77 in notifyConnections (this=0x7f59882ff290) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/conn_notifier.cc:92 |
inverse = false |
#9 ConnNotifierCallback::run (this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/conn_notifier.cc:39 |
connNotifier = {<std::__shared_ptr<ConnNotifier, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<ConnNotifier, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = <optimized out>, _M_refcount = { |
_M_pi = 0x7f59882ff280}}, <No data fields>} |
#10 0x00007f59cc006be3 in GlobalTask::execute (this=0x7f59881548b0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/globaltask.cc:73 |
guard = {previous = 0x0} |
#11 0x00007f59cbfff48f in ExecutorThread::run (this=0x7f59c69bb960) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:188 |
curTaskDescr = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, |
_M_p = 0x7f59c6887c60 <Address 0x7f59c6887c60 out of bounds>}, _M_string_length = 23, {_M_local_buf = "\027\000\000\000\000\000\000\000pressor", _M_allocated_capacity = 23}} |
woketime = <optimized out>
|
scheduleOverhead = <optimized out>
|
again = <optimized out>
|
runtime = <optimized out>
|
q = <optimized out>
|
tick = 198 '\306' |
guard = {engine = 0x0} |
#12 0x00007f59caa10777 in run (this=0x7f59c764c0d0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58 |
No locals.
|
#13 platform_thread_wrap (arg=0x7f59c764c0d0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71 |
context = {_M_t = {
|
_M_t = {<std::_Tuple_impl<0, CouchbaseThread*, std::default_delete<CouchbaseThread> >> = {<std::_Tuple_impl<1, std::default_delete<CouchbaseThread> >> = {<std::_Head_base<1, std::default_delete<CouchbaseThread>, true>> = {<std::default_delete<CouchbaseThread>> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, CouchbaseThread*, false>> = {_M_head_impl = 0x7f59c764c0d0}, <No data fields>}, <No data fields>}}} |
---Type <return> to continue, or q <return> to quit--- |
#14 0x00007f59c804dea5 in start_thread (arg=0x7f598aff5700) at pthread_create.c:307 |
__res = <optimized out>
|
pd = 0x7f598aff5700 |
now = <optimized out>
|
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140022560806656, 1896906296006368233, 0, 8392704, 0, 140022560806656, -1954492124063245335, -1954357521255539735}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = { |
prev = 0x0, cleanup = 0x0, canceltype = 0}}} |
not_first_call = <optimized out>
|
pagesize_m1 = <optimized out>
|
sp = <optimized out>
|
freesize = <optimized out>
|
#15 0x00007f59c7d768dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 |
No locals.
|
From memcached log on .211 node:
grep CRITICAL memcached.log
|
2020-06-01T17:47:46.501335-07:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-2217). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/ca6dfb17-512c-458f-30425c8a-f9c3cde3.dmp before terminating. |
2020-06-01T17:47:46.501373-07:00 CRITICAL Stack backtrace of crashed thread: |
2020-06-01T17:47:46.502284-07:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x1397ad] |
2020-06-01T17:47:46.502308-07:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x14f4fa] |
2020-06-01T17:47:46.502318-07:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x14f838] |
2020-06-01T17:47:46.502325-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f59c8046000+0xf630] |
2020-06-01T17:47:46.502332-07:00 CRITICAL /lib64/libpthread.so.0(pthread_mutex_lock+0) [0x7f59c8046000+0x9d00] |
2020-06-01T17:47:46.502342-07:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x6a1b8] |
2020-06-01T17:47:46.502350-07:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x6a91f] |
2020-06-01T17:47:46.502362-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x10f5e8] |
2020-06-01T17:47:46.502371-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x6ebb0] |
2020-06-01T17:47:46.502379-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x6ba77] |
2020-06-01T17:47:46.502388-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x145be3] |
2020-06-01T17:47:46.502394-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7f59cbec1000+0x13e48f] |
2020-06-01T17:47:46.502400-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f59caa00000+0x10777] |
2020-06-01T17:47:46.502406-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f59c8046000+0x7ea5] |
2020-06-01T17:47:46.502438-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f59c7c78000+0xfe8dd] |