Thuan Nguyen in that case, please see if there are any other open bugs in MH with same issue - if not and this is consistent, please reopen and mark it as critical for dev triage
Arunkumar Senthilnathan (Inactive)
added a comment - Thuan Nguyen in that case, please see if there are any other open bugs in MH with same issue - if not and this is consistent, please reopen and mark it as critical for dev triage
Seeing the following error with build 6.5.0-4922 on 172.23.121.226. (see 172.23.121.226_20191205-022439)
GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1184
The expect that is failing is as follows:
if (sendHCS) {
Expects(hcsToSend.is_initialized()); <<<<< EXPECT that is failing
log(spdlog::level::level_enum::info,
"{} ActiveStream::snapshot: Sending disk snapshot with start "
"seqno {}, end seqno {}, and"
" high completed seqno {}",
logPrefix,
snapStart,
snapEnd,
hcsToSend);
}
This looks like the fix in MB-37103 is not sufficient.
Daniel Owen
added a comment - - edited Seeing the following error with build 6.5.0-4922 on 172.23.121.226. (see 172.23.121.226_20191205-022439)
GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1184
The expect that is failing is as follows:
if (sendHCS) {
Expects(hcsToSend.is_initialized()); <<<<< EXPECT that is failing
log(spdlog::level::level_enum::info,
"{} ActiveStream::snapshot: Sending disk snapshot with start "
"seqno {}, end seqno {}, and"
" high completed seqno {}",
logPrefix,
snapStart,
snapEnd,
hcsToSend);
}
This looks like the fix in MB-37103 is not sufficient.
Node .226 (which experienced the crash) has an ERROR earlier, when processing a DCP_SEQNO_ACKNOWLEDGEMENT before the topology is setup, resulting in the replication stream to .225 being closed:
2019-12-04T17:57:52.233463-08:00 ERROR 56: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 172.23.121.225:47004 - 172.23.121.226:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fca9c06a010","ewouldblock":false,"packet":{"bodylen":8,"cas":0,"datatype":"raw","extlen":8,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":1,"opcode":"DCP_SEQNO_ACKNOWLEDGED","vbucket":854},"refcount":1}] - closing connection ([ 172.23.121.225:47004 - 172.23.121.226:11209 (<ud>@ns_server</ud>) ]): ActiveDurabilityMonitor::State::processSeqnoAck vb:854 FirstChain not set
2019-12-04T17:57:52.233542-08:00 INFO 56: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.225:default - Removing connection [ 172.23.121.225:47004 - 172.23.121.226:11209 (<ud>@ns_server</ud>) ]
2019-12-04T17:57:52.233568-08:00 WARNING 56: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.225:default - (vb:854) Stream closing, sent until seqno 2 remaining items 0, reason: The stream closed early because the conn was disconnected
... all other streams on this connection also close ...
...
# ns_server sets topology on all vBuckets, including vb:854 which encountered the error above:
2019-12-04T17:57:56.845170-08:00 INFO (default) VBucket::setState: transitioning vb:854 with high seqno:2 from:active to:active meta:{"topology":[["ns_1@172.23.121.226","ns_1@172.23.121.225"]]}
However I think this is relatively benign, however we should investigate it - what I believe should happen is the SEQNO_ACK message should be queued, until the topology is set.
—
Back to the crash, it occurs when trying to send a DCP snapshot marker from node .226 to node .224 for vb:334:
2019-12-04T17:58:45.836499-08:00 INFO 62: Client 172.23.121.224:45433 authenticated as <ud>@ns_server</ud>
2019-12-04T17:58:45.845365-08:00 INFO 62: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS, DELETE_TIMES [ 172.23.121.224:45433 - 172.23.121.226:11209 (<ud>@ns_server</ud>) ]
2019-12-04T17:58:45.881224-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.224:default - (vb:334) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
2019-12-04T17:58:45.881333-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.224:default - (vb:334) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.224:default" backfill:true, seqno:1
2019-12-04T17:58:45.881356-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.224:default - (vb:334) ActiveStream::transitionState: Transitioning from backfilling to in-memory
2019-12-04T17:58:45.881753-08:00 CRITICAL *** Fatal error encountered during exception handling ***
2019-12-04T17:58:46.007109-08:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.0-4922). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/579b9573-ae2a-e952-58b0a7a6-726024ee.dmp before terminating.
2019-12-04T17:58:46.007138-08:00 CRITICAL Stack backtrace of crashed thread:
#0 0x00007fca9d4581f7 in raise () from /usr/lib64/libc-2.17.so
No symbol table info available.
#1 0x00007fca9d4598e8 in abort () from /usr/lib64/libc-2.17.so
No symbol table info available.
#2 0x00007fca9dfac195 in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
terminating = false
t = <optimized out>
#3 0x0000000000545d72 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86
No locals.
#4 0x00007fca9dfa9f86 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
No locals.
#5 0x00007fca9dfa9fd1 in std::terminate () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
No locals.
#6 0x00007fca9dfaa213 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7fca400009f0, tinfo=tinfo@entry=0x7fca98f9d110 <typeinfo for gsl::fail_fast>, dest=dest@entry=0x7fca98adfc90 <gsl::fail_fast::~fail_fast()>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
globals = <optimized out>
header = 0x7fca40000970
#7 0x00007fca98afc098 in fail_fast_assert (message=0x7fca98cf0a48 "GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1184", cond=false)
at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite.h:473
No locals.
#8 ActiveStream::snapshot (this=this@entry=0x7fca68345d10, checkpointType=<optimized out>, items=..., highCompletedSeqno=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1184
flags = <optimized out>
snapEnd = 6
hcsToSend = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = false, m_storage = 140508178662144}, <No data fields>}
isCkptTypeDisk = <optimized out>
snapStart = 0
sendHCS = true
#9 0x00007fca98aff434 in ActiveStream::processItems (this=this@entry=0x7fca68345d10, outstandingItemsResult=..., streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1100
#10 0x00007fca98aff843 in ActiveStream::nextCheckpointItemTask (this=this@entry=0x7fca68345d10, streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:867
#11 0x00007fca98aff924 in ActiveStream::nextCheckpointItemTask (this=0x7fca68345d10) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:857
lh = {_M_device = @0x7fca68345d78}
#12 0x00007fca98b089f9 in ActiveStreamCheckpointProcessorTask::run (this=0x7fca90365890) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream_checkpoint_processor_task.cc:56
#13 0x00007fca98bb7964 in ExecutorThread::run (this=0x7fca9bf40a40) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:187
curTaskDescr = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7fca92183e80 <Address 0x7fca92183e80 out of bounds>}, _M_string_length = 107, {
Here we can see that in frame #8, when preparing DCP items from CheckpointManager (in-memory phase) we are attempting to send the high_completed_seqno (sendHCS = true) but hcsToSend is not initialised (missing):
#8 ActiveStream::snapshot (this=this@entry=0x7fca68345d10, checkpointType=<optimized out>, items=..., highCompletedSeqno=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1184
flags = <optimized out>
snapEnd = 6
hcsToSend = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = false, m_storage = 140508178662144}, <No data fields>}
isCkptTypeDisk = <optimized out>
snapStart = 0
sendHCS = true
Where did this vbucket (and specifically this Checkpoint - up to seqno:6) come from? vb:334 was previously sent from node .224 to node .226. Node .226 logs (which is running 6.5.0-4922 at this point in time):
2019-12-04T17:49:09.353854-08:00 INFO (default) VBucket: created vb:334 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:144175708859207 topology:null
2019-12-04T17:49:09.364325-08:00 INFO (default) VBucket::setState: transitioning vb:334 with high seqno:0 from:replica to:replica
2019-12-04T17:49:09.371449-08:00 INFO 73: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb:334) Attempting to add stream: opaque_:18, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:144175708859207, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:
0, stream_req_value:{"uid":"0"}
2019-12-04T17:54:15.377380-08:00 INFO (default) VBucket::setState: transitioning vb:334 with high seqno:6 from:replica to:pending
2019-12-04T17:54:15.759335-08:00 INFO 73: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb:334) Setting stream to dead state, last_seqno is 6, unAckedBytes is 0, status is The stream closed due to a close stream message
2019-12-04T17:54:15.903546-08:00 INFO 62: (default) DCP (Consumer) eq_dcpq:replication:takeover:883bb157c17a8653edc697d8a806f4f0:ns_1@172.23.121.224->ns_1@172.23.121.226:default:334 - (vb:334) Attempting to add takeover stream: opaque_:1, start_seqno_:6, end_seqno_:18446744073709551615, vb_uuid:11804952313819
2019-12-04T17:54:15.998054-08:00 INFO (default) VBucket::setState: transitioning vb:334 with high seqno:6 from:pending to:active
Looking at the producer node logs (.224) for the same time we can see the producer's view of this - note the producer is running build 5.0.0-3519 at this time:
2019-12-04T17:49:09.375865-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb 334) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615
2019-12-04T17:49:09.375991-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb 334) Scheduling backfill from 1 to 6, reschedule flag : False
2019-12-04T17:49:09.376405-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb 334) Sending disk snapshot with start seqno 0 and end seqno 6
2019-12-04T17:49:09.376737-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb 334) Backfill complete, 0 items read from disk, 4 from memory, last seqno read: 6, pendingBackfill : False
2019-12-04T17:49:09.376978-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - ActiveStream::transitionState: (vb 334) Transitioning from backfilling to in-memory
To summarise- we crash when trying to send from node .226 a Disk Checkpoint without a HCS. This Checkpoint originally came from node .224 when it was running CB 5.0.0; as such no HCS would have been sent to .226.
The fix is to ensure that Disk Checkpoints received from non-SyncReplication capable Producers automatically set the HCS to zero (given before SyncReplication nothing could be Committed).
Dave Rigby
added a comment - - edited Node .226 (which experienced the crash) has an ERROR earlier, when processing a DCP_SEQNO_ACKNOWLEDGEMENT before the topology is setup, resulting in the replication stream to .225 being closed:
2019-12-04T17:57:52.233463-08:00 ERROR 56: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 172.23.121.225:47004 - 172.23.121.226:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fca9c06a010","ewouldblock":false,"packet":{"bodylen":8,"cas":0,"datatype":"raw","extlen":8,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":1,"opcode":"DCP_SEQNO_ACKNOWLEDGED","vbucket":854},"refcount":1}] - closing connection ([ 172.23.121.225:47004 - 172.23.121.226:11209 (<ud>@ns_server</ud>) ]): ActiveDurabilityMonitor::State::processSeqnoAck vb:854 FirstChain not set
2019-12-04T17:57:52.233542-08:00 INFO 56: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.225:default - Removing connection [ 172.23.121.225:47004 - 172.23.121.226:11209 (<ud>@ns_server</ud>) ]
2019-12-04T17:57:52.233568-08:00 WARNING 56: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.225:default - (vb:854) Stream closing, sent until seqno 2 remaining items 0, reason: The stream closed early because the conn was disconnected
... all other streams on this connection also close ...
...
# ns_server sets topology on all vBuckets, including vb:854 which encountered the error above:
2019-12-04T17:57:56.845170-08:00 INFO (default) VBucket::setState: transitioning vb:854 with high seqno:2 from:active to:active meta:{"topology":[["ns_1@172.23.121.226","ns_1@172.23.121.225"]]}
However I think this is relatively benign, however we should investigate it - what I believe should happen is the SEQNO_ACK message should be queued, until the topology is set.
—
Back to the crash, it occurs when trying to send a DCP snapshot marker from node .226 to node .224 for vb:334:
2019-12-04T17:58:45.836499-08:00 INFO 62: Client 172.23.121.224:45433 authenticated as <ud>@ns_server</ud>
2019-12-04T17:58:45.842166-08:00 INFO 62: HELO [proxy] XATTR, Snappy, JSON [ 172.23.121.224:45433 - 172.23.121.226:11209 (<ud>@ns_server</ud>) ]
2019-12-04T17:58:45.845365-08:00 INFO 62: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS, DELETE_TIMES [ 172.23.121.224:45433 - 172.23.121.226:11209 (<ud>@ns_server</ud>) ]
2019-12-04T17:58:45.881224-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.224:default - (vb:334) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
2019-12-04T17:58:45.881333-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.224:default - (vb:334) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.224:default" backfill:true, seqno:1
2019-12-04T17:58:45.881356-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.226->ns_1@172.23.121.224:default - (vb:334) ActiveStream::transitionState: Transitioning from backfilling to in-memory
2019-12-04T17:58:45.881753-08:00 CRITICAL *** Fatal error encountered during exception handling ***
2019-12-04T17:58:45.881817-08:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1184
2019-12-04T17:58:46.007109-08:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.0-4922). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/579b9573-ae2a-e952-58b0a7a6-726024ee.dmp before terminating.
2019-12-04T17:58:46.007138-08:00 CRITICAL Stack backtrace of crashed thread:
2019-12-04T17:58:46.007421-08:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x13243d]
2019-12-04T17:58:46.007452-08:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ce) [0x400000+0x14a47e]
2019-12-04T17:58:46.007473-08:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0x94) [0x400000+0x14a794]
2019-12-04T17:58:46.007489-08:00 CRITICAL /lib64/libpthread.so.0() [0x7fca9d7e6000+0xf5e0]
2019-12-04T17:58:46.007538-08:00 CRITICAL /lib64/libc.so.6(gsignal+0x37) [0x7fca9d423000+0x351f7]
2019-12-04T17:58:46.007586-08:00 CRITICAL /lib64/libc.so.6(abort+0x148) [0x7fca9d423000+0x368e8]
2019-12-04T17:58:46.007662-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7fca9df1b000+0x91195]
2019-12-04T17:58:46.007686-08:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x145d72]
2019-12-04T17:58:46.007757-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fca9df1b000+0x8ef86]
2019-12-04T17:58:46.007828-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fca9df1b000+0x8efd1]
2019-12-04T17:58:46.007892-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fca9df1b000+0x8f213]
2019-12-04T17:58:46.007918-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fca98a88000+0x74098]
2019-12-04T17:58:46.007937-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fca98a88000+0x77434]
2019-12-04T17:58:46.007956-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fca98a88000+0x77843]
2019-12-04T17:58:46.007975-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fca98a88000+0x77924]
2019-12-04T17:58:46.007994-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fca98a88000+0x809f9]
2019-12-04T17:58:46.008015-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7fca98a88000+0x12f964]
2019-12-04T17:58:46.008025-08:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7fca9fdc4000+0x8ee7]
2019-12-04T17:58:46.008038-08:00 CRITICAL /lib64/libpthread.so.0() [0x7fca9d7e6000+0x7e25]
2019-12-04T17:58:46.008100-08:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7fca9d423000+0xf834d]
Full backtrace from minidump via gdb:
(gdb) bt full
#0 0x00007fca9d4581f7 in raise () from /usr/lib64/libc-2.17.so
No symbol table info available.
#1 0x00007fca9d4598e8 in abort () from /usr/lib64/libc-2.17.so
No symbol table info available.
#2 0x00007fca9dfac195 in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
terminating = false
t = <optimized out>
#3 0x0000000000545d72 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86
No locals.
#4 0x00007fca9dfa9f86 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
No locals.
#5 0x00007fca9dfa9fd1 in std::terminate () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
No locals.
#6 0x00007fca9dfaa213 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7fca400009f0, tinfo=tinfo@entry=0x7fca98f9d110 <typeinfo for gsl::fail_fast>, dest=dest@entry=0x7fca98adfc90 <gsl::fail_fast::~fail_fast()>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
globals = <optimized out>
header = 0x7fca40000970
#7 0x00007fca98afc098 in fail_fast_assert (message=0x7fca98cf0a48 "GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1184", cond=false)
at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite .h:473
No locals.
#8 ActiveStream::snapshot (this=this@entry=0x7fca68345d10, checkpointType=<optimized out>, items=..., highCompletedSeqno=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1184
flags = <optimized out>
snapEnd = 6
hcsToSend = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = false , m_storage = 140508178662144}, <No data fields>}
isCkptTypeDisk = <optimized out>
snapStart = 0
sendHCS = true
#9 0x00007fca98aff434 in ActiveStream::processItems (this=this@entry=0x7fca68345d10, outstandingItemsResult=..., streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1100
mutations = {<std::_Deque_base<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> >, std::allocator<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > > >> = {
_M_impl = {<std::allocator<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > >> = {<__gnu_cxx::new_allocator<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > >> = {<No data fields>}, <No data fields>}, _M_map = 0x7fca687e8540, _M_map_size = 8, _M_start = {
_M_cur = 0x7fca62ccb200, _M_first = 0x7fca62ccb200, _M_last = 0x7fca62ccb400, _M_node = 0x7fca687e8558}, _M_finish = {_M_cur = 0x7fca62ccb220, _M_first = 0x7fca62ccb200, _M_last = 0x7fca62ccb400, _M_node = 0x7fca687e8558}}}, <No data fields>}
#10 0x00007fca98aff843 in ActiveStream::nextCheckpointItemTask (this=this@entry=0x7fca68345d10, streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:867
res = {checkpointType = Disk, items = {<std::_Vector_base<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> >, std::allocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > > >> = {
_M_impl = {<std::allocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > >> = {<__gnu_cxx::new_allocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > >> = {<No data fields>}, <No data fields>}, _M_start = 0x7fca9c01cf80, _M_finish = 0x7fca9c01cfd0,
_M_end_of_storage = 0x7fca9c01d000}}, <No data fields>}, highCompletedSeqno = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = false , m_storage = 140508178662144}, <No data fields>}}
vbucket = {<std::__shared_ptr<VBucket, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<VBucket, (__gnu_cxx::_Lock_policy)2, false , false >> = {<No data fields>}, _M_ptr = 0x7fca683d6200, _M_refcount = {_M_pi = 0x7fca921219c0}}, <No data fields>}
#11 0x00007fca98aff924 in ActiveStream::nextCheckpointItemTask (this=0x7fca68345d10) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:857
lh = {_M_device = @0x7fca68345d78}
#12 0x00007fca98b089f9 in ActiveStreamCheckpointProcessorTask::run (this=0x7fca90365890) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream_checkpoint_processor_task.cc:56
rh = {<StreamContainer<std::shared_ptr<Stream> >::Iterable<std::_Fwd_list_const_iterator<std::shared_ptr<Stream> > >> = {itr = {_M_node = 0x7fca62aac2e0}, endItr = {_M_node = 0x0}, before = {_M_node = <optimized out>}}, readLock = {_M_pm = 0x7fca9c06c478, _M_owns = <optimized out>},
container = <optimized out>}
streams = {<std::__shared_ptr<StreamContainer<std::shared_ptr<Stream> >, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<StreamContainer<std::shared_ptr<Stream> >, (__gnu_cxx::_Lock_policy)2, false , false >> = {<No data fields>}, _M_ptr = 0x7fca9c06c460, _M_refcount = {
_M_pi = 0x7fca9c06c450}}, <No data fields>}
iterations = 0
expected = 96
#13 0x00007fca98bb7964 in ExecutorThread::run (this=0x7fca9bf40a40) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:187
curTaskDescr = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7fca92183e80 <Address 0x7fca92183e80 out of bounds>}, _M_string_length = 107, {
_M_local_buf = "k\000\000\000\000\000\000\000۔\003\236\312\177\000" , _M_allocated_capacity = 107}}
woketime = <optimized out>
scheduleOverhead = <optimized out>
again = <optimized out>
runtime = <optimized out>
q = <optimized out>
tick = 64 '@'
Here we can see that in frame #8, when preparing DCP items from CheckpointManager (in-memory phase) we are attempting to send the high_completed_seqno ( sendHCS = true ) but hcsToSend is not initialised (missing):
#8 ActiveStream::snapshot (this=this@entry=0x7fca68345d10, checkpointType=<optimized out>, items=..., highCompletedSeqno=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1184
flags = <optimized out>
snapEnd = 6
hcsToSend = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = false , m_storage = 140508178662144}, <No data fields>}
isCkptTypeDisk = <optimized out>
snapStart = 0
sendHCS = true
Where did this vbucket (and specifically this Checkpoint - up to seqno:6) come from? vb:334 was previously sent from node .224 to node .226. Node .226 logs (which is running 6.5.0-4922 at this point in time):
2019-12-04T17:49:09.353854-08:00 INFO (default) VBucket: created vb:334 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:144175708859207 topology:null
2019-12-04T17:49:09.364325-08:00 INFO (default) VBucket::setState: transitioning vb:334 with high seqno:0 from:replica to:replica
2019-12-04T17:49:09.371449-08:00 INFO 73: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb:334) Attempting to add stream: opaque_:18, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:144175708859207, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:
0, stream_req_value:{ "uid" : "0" }
2019-12-04T17:54:15.377380-08:00 INFO (default) VBucket::setState: transitioning vb:334 with high seqno:6 from:replica to:pending
2019-12-04T17:54:15.759335-08:00 INFO 73: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb:334) Setting stream to dead state, last_seqno is 6, unAckedBytes is 0, status is The stream closed due to a close stream message
2019-12-04T17:54:15.903546-08:00 INFO 62: (default) DCP (Consumer) eq_dcpq:replication:takeover:883bb157c17a8653edc697d8a806f4f0:ns_1@172.23.121.224->ns_1@172.23.121.226:default:334 - (vb:334) Attempting to add takeover stream: opaque_:1, start_seqno_:6, end_seqno_:18446744073709551615, vb_uuid:11804952313819
0, snap_start_seqno_:6, snap_end_seqno_:6, last_seqno:6, stream_req_value:{ "uid" : "0" }
2019-12-04T17:54:15.998054-08:00 INFO (default) VBucket::setState: transitioning vb:334 with high seqno:6 from:pending to:active
Looking at the producer node logs (.224) for the same time we can see the producer's view of this - note the producer is running build 5.0.0-3519 at this time:
2019-12-04T17:49:09.375865-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb 334) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615
2019-12-04T17:49:09.375991-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb 334) Scheduling backfill from 1 to 6, reschedule flag : False
2019-12-04T17:49:09.376405-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb 334) Sending disk snapshot with start seqno 0 and end seqno 6
2019-12-04T17:49:09.376737-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - (vb 334) Backfill complete, 0 items read from disk, 4 from memory, last seqno read: 6, pendingBackfill : False
2019-12-04T17:49:09.376978-08:00 NOTICE 54: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.224->ns_1@172.23.121.226:default - ActiveStream::transitionState: (vb 334) Transitioning from backfilling to in-memory
To summarise - we crash when trying to send from node .226 a Disk Checkpoint without a HCS. This Checkpoint originally came from node .224 when it was running CB 5.0.0; as such no HCS would have been sent to .226.
The fix is to ensure that Disk Checkpoints received from non-SyncReplication capable Producers automatically set the HCS to zero (given before SyncReplication nothing could be Committed).
Build couchbase-server-6.5.0-4927 contains kv_engine commit 9029b44 with commit message: MB-37161: Set HCS for Disk Snapshots received from non-MH Producers
Couchbase Build Team
added a comment - Build couchbase-server-6.5.0-4927 contains kv_engine commit 9029b44 with commit message:
MB-37161 : Set HCS for Disk Snapshots received from non-MH Producers
Build couchbase-server-6.5.1-6001 contains kv_engine commit 9029b44 with commit message: MB-37161: Set HCS for Disk Snapshots received from non-MH Producers
Couchbase Build Team
added a comment - Build couchbase-server-6.5.1-6001 contains kv_engine commit 9029b44 with commit message:
MB-37161 : Set HCS for Disk Snapshots received from non-MH Producers
Fix is incomplete - the initial fix incorrectly assumed that DCP features (including SyncReplication) are negotiated before PassiveStream objects are created.
Dave Rigby
added a comment - Fix is incomplete - the initial fix incorrectly assumed that DCP features (including SyncReplication) are negotiated before PassiveStream objects are created.
Build couchbase-server-6.5.0-4942 contains kv_engine commit cbdd619 with commit message: MB-37161: Check if SyncRepl enabled after DCP Control negoatiation complete
Couchbase Build Team
added a comment - Build couchbase-server-6.5.0-4942 contains kv_engine commit cbdd619 with commit message:
MB-37161 : Check if SyncRepl enabled after DCP Control negoatiation complete
Build couchbase-server-7.0.0-1123 contains kv_engine commit 9029b44 with commit message: MB-37161: Set HCS for Disk Snapshots received from non-MH Producers
Couchbase Build Team
added a comment - Build couchbase-server-7.0.0-1123 contains kv_engine commit 9029b44 with commit message:
MB-37161 : Set HCS for Disk Snapshots received from non-MH Producers
Build couchbase-server-6.5.1-6006 contains kv_engine commit cbdd619 with commit message: MB-37161: Check if SyncRepl enabled after DCP Control negoatiation complete
Couchbase Build Team
added a comment - Build couchbase-server-6.5.1-6006 contains kv_engine commit cbdd619 with commit message:
MB-37161 : Check if SyncRepl enabled after DCP Control negoatiation complete
Build couchbase-server-7.0.0-1133 contains kv_engine commit cbdd619 with commit message: MB-37161: Check if SyncRepl enabled after DCP Control negoatiation complete
Couchbase Build Team
added a comment - Build couchbase-server-7.0.0-1133 contains kv_engine commit cbdd619 with commit message:
MB-37161 : Check if SyncRepl enabled after DCP Control negoatiation complete
Build couchbase-server-6.5.1-6015 contains kv_engine commit 39d4604 with commit message: MB-37187: Add cluster_test test for MB-37161
Couchbase Build Team
added a comment - Build couchbase-server-6.5.1-6015 contains kv_engine commit 39d4604 with commit message:
MB-37187 : Add cluster_test test for MB-37161
Build couchbase-server-7.0.0-1144 contains kv_engine commit 39d4604 with commit message: MB-37187: Add cluster_test test for MB-37161
Couchbase Build Team
added a comment - Build couchbase-server-7.0.0-1144 contains kv_engine commit 39d4604 with commit message:
MB-37187 : Add cluster_test test for MB-37161
Build couchbase-server-6.6.0-7519 contains kv_engine commit 39d4604 with commit message: MB-37187: Add cluster_test test for MB-37161
Couchbase Build Team
added a comment - Build couchbase-server-6.6.0-7519 contains kv_engine commit 39d4604 with commit message:
MB-37187 : Add cluster_test test for MB-37161
Dupe of
MB-37103