During kv-engine-jepsen-nightly-352 we observed a and crash during the following test: lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=kill --scenario=hard-reboot --replicas=2 --no-autofailover --durability=0:0:0:100 --eviction-policy=full --cas
Crash Message:
The earilist minidump in the logs is 29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp on node .174, from:
2019-12-08T02:18:05.289332+00:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.1-6001). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp before terminating.
Backtrace of crashing thread:
(gdb) bt
#0 0x00007f2d6f4cf428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f2d6f4d102a in __GI_abort () at abort.c:89
#2 0x00007f2d70030ced in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:50
#3 0x0000000000545f12 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86
#4 0x00007f2d7002eb16 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
#5 0x00007f2d7002eb61 in std::terminate () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
#6 0x00007f2d7002eda3 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f2d1c000b40, tinfo=tinfo@entry=0x7f2d6af9d010 <typeinfo for gsl::fail_fast>, dest=dest@entry=0x7f2d6aadebf0 <gsl::fail_fast::~fail_fast()>)
at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
#7 0x00007f2d6ab57dd8 in gsl::fail_fast_assert (cond=<optimized out>, message=0x7f2d6acfa710 "GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620")
at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite.h:473
#8 0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620
#9 0x00007f2d6abbc6e0 in Flusher::flushVB (this=this@entry=0x7f2d6bf11000) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:306
#10 0x00007f2d6abbd5b1 in Flusher::step (this=0x7f2d6bf11000, task=0x7f2d6ef6d970) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:215
#11 0x00007f2d6abb7154 in ExecutorThread::run (this=0x7f2d6be547a0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:190
#12 0x00007f2d71e4ce27 in CouchbaseThread::run (this=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
#13 platform_thread_wrap (arg=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
#14 0x00007f2d6f86b6ba in start_thread (arg=0x7f2d3cdfa700) at pthread_create.c:333
#15 0x00007f2d6f5a141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Frame #8 has most of the interesting state - we crashed when attempting to flush vb:134 (replica). hcs is not available (optimized out), but vbstate.persistedCompletedSeqno is available and is equal to 9; so hcs must have been set and had value 9 or lower:
#8 0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620
From the logs, the history of vb:134 is as follows:
Clean shutdown of bucket (from ns_server).
On restart, created as a replica during warmup, no prepares loaded from disk; then it is populated from node .15 then throws an exception during flushing:
2019-12-08T02:18:03.505480+00:00 INFO (default) VBucket: created vb:134 with state:replica initialState:replica lastSeqno:10 persistedRange:{10,10} max_cas:1575771443439534080 uuid:27948380356516 topology:null
2019-12-08T02:18:03.790250+00:00 INFO (default) EPVBucket::loadOutstandingPrepares: (vb:134) created DM with PCS:9, PPS:9, HPS:9, number of prepares loaded:0
2019-12-08T02:18:03.810857+00:00 DEBUG (default) Mapped new vb:134 in state replica
Problem identified - issue is with fix for MB-37161 (http://review.couchbase.org/#/c/119024/). The assumption that the DCP Consumer would have fully negotiated SyncReplication when the PassiveStream object is created is invalid. We must instead wait for DCP Control negotiation to complete.
Fix should be relatively straightforward- should have fix ready by end of tomorrow (Tuesday 10th).
Dave Rigby
added a comment - Problem identified - issue is with fix for MB-37161 ( http://review.couchbase.org/#/c/119024/ ). The assumption that the DCP Consumer would have fully negotiated SyncReplication when the PassiveStream object is created is invalid. We must instead wait for DCP Control negotiation to complete.
Fix should be relatively straightforward- should have fix ready by end of tomorrow (Tuesday 10th).
This issue is caused by an incomplete fix for MB-37161 - the initial fix for MB-37161 incorrectly assumed DCP features were negotiated before PassiveStream objects would be created.
Dave Rigby
added a comment - This issue is caused by an incomplete fix for MB-37161 - the initial fix for MB-37161 incorrectly assumed DCP features were negotiated before PassiveStream objects would be created.
Re-opening MB-37161 and amending the fix there.
The earilist minidump in the logs is 29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp on node .174, from:
2019-12-08T02:18:05.289332+00:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.1-6001). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp before terminating.
Backtrace of crashing thread:
(gdb) bt
#0 0x00007f2d6f4cf428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f2d6f4d102a in __GI_abort () at abort.c:89
#2 0x00007f2d70030ced in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:50
#3 0x0000000000545f12 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86
#4 0x00007f2d7002eb16 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
#5 0x00007f2d7002eb61 in std::terminate () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
#6 0x00007f2d7002eda3 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f2d1c000b40, tinfo=tinfo@entry=0x7f2d6af9d010 <typeinfo for gsl::fail_fast>, dest=dest@entry=0x7f2d6aadebf0 <gsl::fail_fast::~fail_fast()>)
at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
#7 0x00007f2d6ab57dd8 in gsl::fail_fast_assert (cond=<optimized out>, message=0x7f2d6acfa710 "GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620")
at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite.h:473
#8 0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620
#9 0x00007f2d6abbc6e0 in Flusher::flushVB (this=this@entry=0x7f2d6bf11000) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:306
#10 0x00007f2d6abbd5b1 in Flusher::step (this=0x7f2d6bf11000, task=0x7f2d6ef6d970) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:215
#11 0x00007f2d6abb7154 in ExecutorThread::run (this=0x7f2d6be547a0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:190
#12 0x00007f2d71e4ce27 in CouchbaseThread::run (this=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
#13 platform_thread_wrap (arg=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
#14 0x00007f2d6f86b6ba in start_thread (arg=0x7f2d3cdfa700) at pthread_create.c:333
#15 0x00007f2d6f5a141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Frame #8 has most of the interesting state - we crashed when attempting to flush vb:134 (replica). hcs is not available (optimized out), but vbstate.persistedCompletedSeqno is available and is equal to 9; so hcs must have been set and had value 9 or lower:
#8 0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620
rlh = {lock_ = 0x7f2d3fea62a0, token_ = {type_ = folly::SharedMutexToken::Type::INLINE_SHARED, slot_ = 0}}
options = VBStatePersist::VBSTATE_CACHE_UPDATE_ONLY
prev = <optimized out>
maxSeqno = 36
maxVbStateOpCas = <optimized out>
minSeqno = 35
mustCheckpointVBState = <optimized out>
collectionFlush = ...
flush_end = <optimized out>
persistedVbState = <optimized out>
hcs = <optimized out>
hps = <optimized out>
maxVisibleSeqno = <optimized out>
trans_time = <optimized out>
toFlush = {items = ...,
ranges = ...,
moreAvailable = false,
maxDeletedRevSeqno = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>},
m_initialized = false, m_storage = 0}, <No data fields>},
checkpointType = CheckpointType::Disk}
range = {<boost::optional_detail::optional_base<snapshot_range_t>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = true, m_storage = {dummy_ = {data = "$\000\000\000\000\000\000\000$\000\000\000\000\000\000",
aligner_ = {<No data fields>}}}}, <No data fields>}
rwUnderlying = 0x7f2d6ef46800
vbstate = {static CurrentVersion = 3,
state = vbucket_state_dead,
maxDeletedSeqno = {counter = {_M_elems = "\000\000\000\000\000"}},
highSeqno = 10,
purgeSeqno = 0,
lastSnapStart = 36,
lastSnapEnd = 36,
maxCas = 1575771481264816128,
hlcCasEpochSeqno = 1,
mightContainXattrs = false,
supportsNamespaces = true,
version = 3,
persistedCompletedSeqno = 9,
persistedPreparedSeqno = 9,
highPreparedSeqno = 36,
maxVisibleSeqno = 10,
onDiskPrepares = 1,
checkpointType = CheckpointType::Disk,
transition = {failovers = {...},
replicationTopology = {...},
state = vbucket_state_replica}
}
vbstateRollback = {static CurrentVersion = 3,
state = vbucket_state_dead,
maxDeletedSeqno = {counter = {_M_elems = "\000\000\000\000\000"}},
highSeqno = 10,
lastSnapStart = 10,
lastSnapEnd = 10,
maxCas = 1575771443439534080,
hlcCasEpochSeqno = 1,
mightContainXattrs = false,
supportsNamespaces = true,
version = 3,
persistedCompletedSeqno = 9,
persistedPreparedSeqno = 9,
highPreparedSeqno = 9,
maxVisibleSeqno = 10,
onDiskPrepares = 1,
checkpointType = CheckpointType::Memory,
transition = {failovers = ...,
replicationTopology = {...},
state = vbucket_state_replica}}
items = ...
items_flushed = <optimized out>
moreAvailable = false
flush_start = {__d = {__r = 28865831823}}
vb = ...
From the logs, the history of vb:134 is as follows:
2019-12-08T02:18:03.505480+00:00 INFO (default) VBucket: created vb:134 with state:replica initialState:replica lastSeqno:10 persistedRange:{10,10} max_cas:1575771443439534080 uuid:27948380356516 topology:null
2019-12-08T02:18:03.790250+00:00 INFO (default) EPVBucket::loadOutstandingPrepares: (vb:134) created DM with PCS:9, PPS:9, HPS:9, number of prepares loaded:0
2019-12-08T02:18:03.810857+00:00 DEBUG (default) Mapped new vb:134 in state replica
2019-12-08T02:18:03.899025+00:00 DEBUG (default) CouchKVStore::snapshotVBucket: Snapshotted vb:134 state:{"checkpoint_type":"Memory","completed_seqno":"9","failover_table":[{"id":27948380356516,"seq":0}],"high_prepared_seqno":"9","high_seqno":"10","hlc_epoch":"1","max_cas":"1575771443439534080","max_deleted_seqno":"0","max_visible_seqno":"10","might_contain_xattrs":false,"namespaces_supported":true,"on_disk_prepares":"1","prepared_seqno":"9","purge_seqno":"0","snap_end":"10","snap_start":"10","state":"replica","version":3}
2019-12-08T02:18:04.775912+00:00 INFO 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - (vb:134) Attempting to add stream: opaque_:52, start_seqno_:10, end_seqno_:18446744073709551615, vb_uuid:27948380356516, snap_start_seqno_:10, snap_end_seqno_:10, last_seqno:10, stream_req_value:{"uid":"0"}
2019-12-08T02:18:04.963678+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - (vb:134) Add stream for opaque 52 with error code succeeded
2019-12-08T02:18:04.963706+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - PassiveStream::transitionState: (vb:134) Transitioning from pending to awaiting-first-snapshot-marker
2019-12-08T02:18:04.963944+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - PassiveStream::transitionState: (vb:134) Transitioning from awaiting-first-snapshot-marker to reading
# Expects() in flushVBucket() fails 2s later:
2019-12-08T02:18:05.071273+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620