Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Duplicate
-
6.5.0
-
None
-
None
-
Triaged
-
Yes
-
KV Sprint 2019-12
Description
During kv-engine-jepsen-nightly-352 we observed a and crash during the following test:
lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=kill --scenario=hard-reboot --replicas=2 --no-autofailover --durability=0:0:0:100 --eviction-policy=full --cas
Crash Message:
2019-12-08T01:36:51.777389+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620
|
ep_bucket.cc:620 |
if (hcs) { |
Expects(hcs > vbstate.persistedCompletedSeqno);
|
vbstate.persistedCompletedSeqno = *hcs;
|
}
|
Couchbase version 6.5.1-6001
Attachments
Issue Links
- is caused by
-
MB-37161 [Upgrade] failover upgrade failed during rebalance last upgrade node in cluster.
-
- Closed
-
- links to
Activity
Field | Original Value | New Value |
---|---|---|
Summary | [Jepsen] Crash | [Jepsen] Crash during hard-reboot scenario |
Remote Link | This issue links to "20191207T181623.000-0800.zip (Web Link)" [ 18944 ] |
Summary | [Jepsen] Crash during hard-reboot scenario | [Jepsen] Crashes during hard-reboot scenario |
Description |
During [kv-engine-jepsen-nightly-352|http://cv.jenkins.couchbase.com/view/Commit%20Validation/view/Jepsen/job/kv-engine-jepsen-nightly/352/] we observed a and crash during the following test:
{{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=kill --scenario=hard-reboot --replicas=2 --no-autofailover --durability=0:0:0:100 --eviction-policy=full --cas}} Crash Message: {noformat} 2019-12-08T01:36:51.777389+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620 {noformat} {code:cpp|title=ep_bucket.cc:620} if (hcs) { Expects(hcs > vbstate.persistedCompletedSeqno); vbstate.persistedCompletedSeqno = *hcs; } {code} |
During [kv-engine-jepsen-nightly-352|http://cv.jenkins.couchbase.com/view/Commit%20Validation/view/Jepsen/job/kv-engine-jepsen-nightly/352/] we observed a and crash during the following test:
{{lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=kill --scenario=hard-reboot --replicas=2 --no-autofailover --durability=0:0:0:100 --eviction-policy=full --cas}} Crash Message: {noformat} 2019-12-08T01:36:51.777389+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620 {noformat} {code:cpp|title=ep_bucket.cc:620} if (hcs) { Expects(hcs > vbstate.persistedCompletedSeqno); vbstate.persistedCompletedSeqno = *hcs; } {code} Couchbase version 6.5.1-6001 |
Summary | [Jepsen] Crashes during hard-reboot scenario | [Jepsen] Couchbase crash during hard-reboot scenario due to HCS > |
Summary | [Jepsen] Couchbase crash during hard-reboot scenario due to HCS > | [Jepsen] Couchbase crash during hard-reboot scenario due to HCS > persistedCompletedSeqno |
Assignee | Daniel Owen [ owend ] | Dave Rigby [ drigby ] |
Priority | Major [ 3 ] | Critical [ 2 ] |
Fix Version/s | Mad-Hatter [ 15037 ] | |
Fix Version/s | 6.5.0 [ 16624 ] |
Affects Version/s | 6.5.1 [ 16622 ] |
Fix Version/s | 6.5.1 [ 16622 ] |
Affects Version/s | 6.5.1 [ 16622 ] |
Fix Version/s | 6.5.1 [ 16622 ] | |
Fix Version/s | 6.5.0 [ 16624 ] |
Sprint | KV Sprint 2019-12 [ 939 ] |
Rank | Ranked higher |
Due Date | 10/Dec/19 |
Is this a Regression? | Unknown [ 10452 ] | Yes [ 10450 ] |
Triage | Untriaged [ 10351 ] | Triaged [ 10350 ] |
Fix Version/s | Mad-Hatter [ 15037 ] | |
Resolution | Duplicate [ 3 ] | |
Status | Open [ 1 ] | Resolved [ 5 ] |
Status | Resolved [ 5 ] | Closed [ 6 ] |
The earilist minidump in the logs is 29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp on node .174, from:
2019-12-08T02:18:05.289332+00:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.1-6001). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp before terminating.
Backtrace of crashing thread:
(gdb) bt
#0 0x00007f2d6f4cf428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f2d6f4d102a in __GI_abort () at abort.c:89
#2 0x00007f2d70030ced in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:50
#3 0x0000000000545f12 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86
#4 0x00007f2d7002eb16 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
#5 0x00007f2d7002eb61 in std::terminate () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
#6 0x00007f2d7002eda3 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f2d1c000b40, tinfo=tinfo@entry=0x7f2d6af9d010 <typeinfo for gsl::fail_fast>, dest=dest@entry=0x7f2d6aadebf0 <gsl::fail_fast::~fail_fast()>)
at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
#7 0x00007f2d6ab57dd8 in gsl::fail_fast_assert (cond=<optimized out>, message=0x7f2d6acfa710 "GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620")
at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite.h:473
#8 0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620
#9 0x00007f2d6abbc6e0 in Flusher::flushVB (this=this@entry=0x7f2d6bf11000) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:306
#10 0x00007f2d6abbd5b1 in Flusher::step (this=0x7f2d6bf11000, task=0x7f2d6ef6d970) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:215
#11 0x00007f2d6abb7154 in ExecutorThread::run (this=0x7f2d6be547a0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:190
#12 0x00007f2d71e4ce27 in CouchbaseThread::run (this=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
#13 platform_thread_wrap (arg=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
#14 0x00007f2d6f86b6ba in start_thread (arg=0x7f2d3cdfa700) at pthread_create.c:333
#15 0x00007f2d6f5a141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Frame #8 has most of the interesting state - we crashed when attempting to flush vb:134 (replica). hcs is not available (optimized out), but vbstate.persistedCompletedSeqno is available and is equal to 9; so hcs must have been set and had value 9 or lower:
#8 0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620
rlh = {lock_ = 0x7f2d3fea62a0, token_ = {type_ = folly::SharedMutexToken::Type::INLINE_SHARED, slot_ = 0}}
options = VBStatePersist::VBSTATE_CACHE_UPDATE_ONLY
prev = <optimized out>
maxSeqno = 36
maxVbStateOpCas = <optimized out>
minSeqno = 35
mustCheckpointVBState = <optimized out>
collectionFlush = ...
flush_end = <optimized out>
persistedVbState = <optimized out>
hcs = <optimized out>
hps = <optimized out>
maxVisibleSeqno = <optimized out>
trans_time = <optimized out>
toFlush = {items = ...,
ranges = ...,
moreAvailable = false,
maxDeletedRevSeqno = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>},
m_initialized = false, m_storage = 0}, <No data fields>},
checkpointType = CheckpointType::Disk}
range = {<boost::optional_detail::optional_base<snapshot_range_t>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = true, m_storage = {dummy_ = {data = "$\000\000\000\000\000\000\000$\000\000\000\000\000\000",
aligner_ = {<No data fields>}}}}, <No data fields>}
rwUnderlying = 0x7f2d6ef46800
vbstate = {static CurrentVersion = 3,
state = vbucket_state_dead,
maxDeletedSeqno = {counter = {_M_elems = "\000\000\000\000\000"}},
highSeqno = 10,
purgeSeqno = 0,
lastSnapStart = 36,
lastSnapEnd = 36,
maxCas = 1575771481264816128,
hlcCasEpochSeqno = 1,
mightContainXattrs = false,
supportsNamespaces = true,
version = 3,
persistedCompletedSeqno = 9,
persistedPreparedSeqno = 9,
highPreparedSeqno = 36,
maxVisibleSeqno = 10,
onDiskPrepares = 1,
checkpointType = CheckpointType::Disk,
transition = {failovers = {...},
replicationTopology = {...},
state = vbucket_state_replica}
}
vbstateRollback = {static CurrentVersion = 3,
state = vbucket_state_dead,
maxDeletedSeqno = {counter = {_M_elems = "\000\000\000\000\000"}},
highSeqno = 10,
lastSnapStart = 10,
lastSnapEnd = 10,
maxCas = 1575771443439534080,
hlcCasEpochSeqno = 1,
mightContainXattrs = false,
supportsNamespaces = true,
version = 3,
persistedCompletedSeqno = 9,
persistedPreparedSeqno = 9,
highPreparedSeqno = 9,
maxVisibleSeqno = 10,
onDiskPrepares = 1,
checkpointType = CheckpointType::Memory,
transition = {failovers = ...,
replicationTopology = {...},
state = vbucket_state_replica}}
items = ...
items_flushed = <optimized out>
moreAvailable = false
flush_start = {__d = {__r = 28865831823}}
vb = ...
From the logs, the history of vb:134 is as follows:
2019-12-08T02:18:03.505480+00:00 INFO (default) VBucket: created vb:134 with state:replica initialState:replica lastSeqno:10 persistedRange:{10,10} max_cas:1575771443439534080 uuid:27948380356516 topology:null
2019-12-08T02:18:03.790250+00:00 INFO (default) EPVBucket::loadOutstandingPrepares: (vb:134) created DM with PCS:9, PPS:9, HPS:9, number of prepares loaded:0
2019-12-08T02:18:03.810857+00:00 DEBUG (default) Mapped new vb:134 in state replica
2019-12-08T02:18:03.899025+00:00 DEBUG (default) CouchKVStore::snapshotVBucket: Snapshotted vb:134 state:{"checkpoint_type":"Memory","completed_seqno":"9","failover_table":[{"id":27948380356516,"seq":0}],"high_prepared_seqno":"9","high_seqno":"10","hlc_epoch":"1","max_cas":"1575771443439534080","max_deleted_seqno":"0","max_visible_seqno":"10","might_contain_xattrs":false,"namespaces_supported":true,"on_disk_prepares":"1","prepared_seqno":"9","purge_seqno":"0","snap_end":"10","snap_start":"10","state":"replica","version":3}
2019-12-08T02:18:04.775912+00:00 INFO 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - (vb:134) Attempting to add stream: opaque_:52, start_seqno_:10, end_seqno_:18446744073709551615, vb_uuid:27948380356516, snap_start_seqno_:10, snap_end_seqno_:10, last_seqno:10, stream_req_value:{"uid":"0"}
2019-12-08T02:18:04.963678+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - (vb:134) Add stream for opaque 52 with error code succeeded
2019-12-08T02:18:04.963706+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - PassiveStream::transitionState: (vb:134) Transitioning from pending to awaiting-first-snapshot-marker
2019-12-08T02:18:04.963944+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - PassiveStream::transitionState: (vb:134) Transitioning from awaiting-first-snapshot-marker to reading
# Expects() in flushVBucket() fails 2s later:
2019-12-08T02:18:05.071273+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620