Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37109

Rebalance fails and memcached crashes seen in Ephemeral rebalance in out tests

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 6.5.0
    • None
    • couchbase-bucket
    • 6.5.0-4908

    Description

      Script to Repro

      ./testrunner -i /tmp/testexec.6198.ini -p get-cbcollect-info=False,bucket_type=ephemeral,GROUP=P1_Set2,get-cbcollect-info=True -t rebalance.rebalanceinout.RebalanceInOutTests.test_incremental_rebalance_in_out_with_mutation_and_expiration,items=100000,value_size=512,max_verify=100000,zone=2,GROUP=IN_OUT;P1;P1_Set2
      

      Test to repro

      Rebalances nodes into and out of the cluster while doing mutations and
      expirations. Use 'zone' param to have nodes divided into server groups
      by having zone > 1.
       
      This test begins by loading a given number of items into the cluster.
      It then adds one node, rebalances that node into the cluster, and then
      rebalances it back out. During the rebalancing we update half of the
      items in the cluster and expire the other half. Once the node has been
      removed and added back we recreate the expired items, wait for the
      disk queues to drain, and then verify that there has been no data loss,
      sum(curr_items) match the curr_items_total.We then remove and
      add back two nodes at a time and so on until we have reached the point
      where we are adding back and removing at least half of the nodes.
      

      Rebalance failure

      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client._rebalance_status_and_progress] {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      2019-12-01 22:08:23 | INFO | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] Latest logs from UI on 172.23.104.211:
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.216', u'code': 0, u'text': u'Bucket "default" loaded on node \'ns_1@172.23.104.216\' in 0 seconds.', u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:20.343Z', u'module': u'ns_memcached', u'tstamp': 1575266900343, u'type': u'info'}
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.216', u'code': 0, u'text': u"Control connection to memcached on 'ns_1@172.23.104.216' disconnected. Check logs for details.", u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:19.303Z', u'module': u'ns_memcached', u'tstamp': 1575266899303, u'type': u'info'}
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.216', u'code': 0, u'text': u"Service 'memcached' exited with status 134. Restarting. Messages:\n2019-12-01T22:08:19.277234-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f6273aac000+0x8f213]\n2019-12-01T22:08:19.277277-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x74098]\n2019-12-01T22:08:19.277296-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x77434]\n2019-12-01T22:08:19.277314-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x77843]\n2019-12-01T22:08:19.277334-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x77924]\n2019-12-01T22:08:19.277352-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x809f9]\n2019-12-01T22:08:19.277373-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f626e488000+0x12f964]\n2019-12-01T22:08:19.277385-08:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f6275955000+0x8ee7]\n2019-12-01T22:08:19.277401-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f6273377000+0x7dd5]\n2019-12-01T22:08:19.277475-08:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f6272faa000+0xfdead]", u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:19.297Z', u'module': u'ns_log', u'tstamp': 1575266899297, u'type': u'info'}
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.211', u'code': 0, u'text': u'auto-reprovision is disabled as maximum number of nodes (1) that can be auto-reprovisioned has been reached.', u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:18.669Z', u'module': u'auto_reprovision', u'tstamp': 1575266898669, u'type': u'info'}
      2019-12-01 22:08:23 | ERROR | MainProcess | Cluster_Thread | [rest_client.print_UI_logs] {u'node': u'ns_1@172.23.104.211', u'code': 0, u'text': u'Bucket "default" has been reprovisioned on following nodes: [\'ns_1@172.23.104.220\']. Nodes on which the data service restarted: [\'ns_1@172.23.104.220\',\n                                                                                                                                 \'ns_1@172.23.104.243\'].', u'shortText': u'message', u'serverTime': u'2019-12-01T22:08:18.668Z', u'module': u'auto_reprovision', u'tstamp': 1575266898668, u'type': u'info'}
      

      Backtrace from gdb

      (gdb) bt
      #0  0x00007f6272fe0207 in __gconv_transform_internal_ucs2reverse () from /usr/lib64/libc-2.17.so
      #1  0x0000000000000006 in ?? ()
      #2  0x00007f6273025dc3 in wprintf () from /usr/lib64/libc-2.17.so
      #3  0x0000000000000001 in ?? ()
      #4  0x0000000a3affb1f0 in ?? ()
      #5  0x000000020000000e in ?? ()
      #6  0x00007f623affd600 in ?? ()
      #7  0x00007f623affb190 in ?? ()
      #8  0x00007f6271b5f400 in ?? ()
      #9  0x0000000000000068 in ?? ()
      #10 0x000000003affd600 in ?? ()
      #11 0x00007f623affb230 in ?? ()
      #12 0x00007f623affbe20 in ?? ()
      #13 0x0000000000000068 in ?? ()
      #14 0x00007f6272a00980 in ?? ()
      #15 0x00007f6274e5fd58 in tcache_alloc_small (slow_path=false, zero=false, binind=10, size=0, tcache=0x7f62730258ce <putwc_unlocked+30>, arena=<optimized out>, tsd=<optimized out>) at include/jemalloc/internal/tcache_inlines.h:60
      #16 arena_malloc (slow_path=false, tcache=0x7f62730258ce <putwc_unlocked+30>, zero=false, ind=10, size=0, arena=0x0, tsdn=<optimized out>) at include/jemalloc/internal/arena_inlines_b.h:165
      #17 iallocztm (slow_path=false, arena=0x0, is_internal=false, tcache=0x7f62730258ce <putwc_unlocked+30>, zero=false, ind=10, size=0, tsdn=<optimized out>) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:53
      #18 imalloc_no_sample (ind=10, usize=0, size=0, tsd=0x7f627336d3a0 <_IO_obstack_jumps+128>, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:1949
      #19 imalloc_body (tsd=0x7f627336d3a0 <_IO_obstack_jumps+128>, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2123
      #20 imalloc (dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2258
      #21 je_malloc_default (size=<optimized out>) at src/jemalloc.c:2289
      #22 0x00007f627596043c in cb_malloc (size=0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_malloc.cc:51
      #23 0x00007f6276a000b9 in operator new (count=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/global_new_replacement.cc:71
      #24 0x00007f626e4faf71 in MutationResponse (sid=..., enableExpiryOut=Yes, includeCollectionID=(unknown: 32), includeDeleteTime=(unknown: 162), includeXattrs=Yes, includeVal=Yes, opaque=2, item=..., this=0x7f6238814c10)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/response.h:429
      #25 make_unique<MutationResponse, SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > const&, unsigned int const&, IncludeValue const&, IncludeXattrs const&, IncludeDeleteTime const&, DocKeyEncodesCollectionId const&, EnableExpiryOutput const&, cb::mcbp::DcpStreamId const&> () at /usr/local/include/c++/7.3.0/bits/unique_ptr.h:825
      #26 ActiveStream::makeResponseFromItem (this=<optimized out>, item=..., sendCommitSyncWriteAs=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1029
      #27 0x00007f626e4ff434 in ActiveStream::processItems (this=0x7f623affb3b0, this@entry=0x7f6238814c10, outstandingItemsResult=..., streamMutex=...)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1101
      #28 0x00007f626e4ff843 in ActiveStream::nextCheckpointItemTask (this=this@entry=0x7f6238814c10, streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:868
      #29 0x00007f626e4ff924 in ActiveStream::nextCheckpointItemTask (this=0x7f6238814c10) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:858
      #30 0x00007f626e5089f9 in ActiveStreamCheckpointProcessorTask::run (this=0x7f6238819110) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream_checkpoint_processor_task.cc:56
      #31 0x00007f626e5b7964 in ExecutorThread::run (this=0x7f6271b97960) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:187
      #32 0x00007f627595dee7 in run (this=0x7f6271a6e670) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
      #33 platform_thread_wrap (arg=0x7f6271a6e670) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
      #34 0x00007f627337edd5 in start_thread () from /usr/lib64/libpthread-2.17.so
      #35 0x00007f62730a7ead in tdestroy_recurse () from /usr/lib64/libc-2.17.so
      #36 0x0000000000000000 in ?? ()
      (gdb) 
      

      cbcollect_info attached.
      Last successful run was on 6.5.0-4897.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            drigby Dave Rigby added a comment -

            First Critical error message seen on .243:

            2019-12-01T22:08:14.137299-08:00 CRITICAL *** Fatal error encountered during exception handling ***
            2019-12-01T22:08:14.137371-08:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1185
            

            Which results in the process terminating:

            2019-12-01T22:08:14.203342-08:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.0-4908). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/6a82f3ec-1a47-1e49-28df0b3e-4ea8477a.dmp before terminating.
            2019-12-01T22:08:14.203385-08:00 CRITICAL Stack backtrace of crashed thread:
            2019-12-01T22:08:14.203636-08:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x13243d]
            2019-12-01T22:08:14.203664-08:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ce) [0x400000+0x14a47e]
            2019-12-01T22:08:14.203675-08:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0x94) [0x400000+0x14a794]
            2019-12-01T22:08:14.203696-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f573267a000+0xf5d0]
            2019-12-01T22:08:14.203725-08:00 CRITICAL     /lib64/libc.so.6(gsignal+0x37) [0x7f57322ad000+0x36207]
            2019-12-01T22:08:14.203750-08:00 CRITICAL     /lib64/libc.so.6(abort+0x148) [0x7f57322ad000+0x378f8]
            2019-12-01T22:08:14.203799-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7f5732daf000+0x91195]
            2019-12-01T22:08:14.203821-08:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x145d72]
            2019-12-01T22:08:14.203848-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f5732daf000+0x8ef86]
            2019-12-01T22:08:14.203885-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f5732daf000+0x8efd1]
            2019-12-01T22:08:14.203911-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f5732daf000+0x8f213]
            2019-12-01T22:08:14.203937-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x74098]
            2019-12-01T22:08:14.203946-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x77434]
            2019-12-01T22:08:14.203964-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x77843]
            2019-12-01T22:08:14.203972-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x77924]
            2019-12-01T22:08:14.203978-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x809f9]
            2019-12-01T22:08:14.203988-08:00 CRITICAL     /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x12f964]
            2019-12-01T22:08:14.203996-08:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f5734c58000+0x8ee7]
            2019-12-01T22:08:14.204006-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f573267a000+0x7dd5]
            2019-12-01T22:08:14.204181-08:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f57322ad000+0xfdead]
            

            The aforementioned precondition failure is in ActiveStream::snapshot:

                    // If the stream supports SyncRep then send the HCS for CktpType::disk
                    const auto sendHCS = supportSyncReplication() && isCkptTypeDisk;
                    const auto hcsToSend = sendHCS ? highCompletedSeqno : boost::none;
                    if (sendHCS) {
                        Expects(hcsToSend.is_initialized());
                        log(spdlog::level::level_enum::info,
                            "{} ActiveStream::snapshot: Sending disk snapshot with start "
                            "seqno {}, end seqno {}, and"
                            " high completed seqno {}",
                            logPrefix,
                            snapStart,
                            snapEnd,
                            hcsToSend);
                    }
            

            Previously reported GDB backtrace looks incorrect, probably due to wrong debug symbols package. Investigating...

            drigby Dave Rigby added a comment - First Critical error message seen on .243: 2019-12-01T22:08:14.137299-08:00 CRITICAL *** Fatal error encountered during exception handling *** 2019-12-01T22:08:14.137371-08:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1185 Which results in the process terminating: 2019-12-01T22:08:14.203342-08:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.0-4908). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/6a82f3ec-1a47-1e49-28df0b3e-4ea8477a.dmp before terminating. 2019-12-01T22:08:14.203385-08:00 CRITICAL Stack backtrace of crashed thread: 2019-12-01T22:08:14.203636-08:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x13243d] 2019-12-01T22:08:14.203664-08:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ce) [0x400000+0x14a47e] 2019-12-01T22:08:14.203675-08:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0x94) [0x400000+0x14a794] 2019-12-01T22:08:14.203696-08:00 CRITICAL /lib64/libpthread.so.0() [0x7f573267a000+0xf5d0] 2019-12-01T22:08:14.203725-08:00 CRITICAL /lib64/libc.so.6(gsignal+0x37) [0x7f57322ad000+0x36207] 2019-12-01T22:08:14.203750-08:00 CRITICAL /lib64/libc.so.6(abort+0x148) [0x7f57322ad000+0x378f8] 2019-12-01T22:08:14.203799-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7f5732daf000+0x91195] 2019-12-01T22:08:14.203821-08:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x145d72] 2019-12-01T22:08:14.203848-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f5732daf000+0x8ef86] 2019-12-01T22:08:14.203885-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f5732daf000+0x8efd1] 2019-12-01T22:08:14.203911-08:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f5732daf000+0x8f213] 2019-12-01T22:08:14.203937-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x74098] 2019-12-01T22:08:14.203946-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x77434] 2019-12-01T22:08:14.203964-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x77843] 2019-12-01T22:08:14.203972-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x77924] 2019-12-01T22:08:14.203978-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x809f9] 2019-12-01T22:08:14.203988-08:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f572d888000+0x12f964] 2019-12-01T22:08:14.203996-08:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f5734c58000+0x8ee7] 2019-12-01T22:08:14.204006-08:00 CRITICAL /lib64/libpthread.so.0() [0x7f573267a000+0x7dd5] 2019-12-01T22:08:14.204181-08:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f57322ad000+0xfdead] The aforementioned precondition failure is in ActiveStream::snapshot: // If the stream supports SyncRep then send the HCS for CktpType::disk const auto sendHCS = supportSyncReplication() && isCkptTypeDisk; const auto hcsToSend = sendHCS ? highCompletedSeqno : boost::none; if (sendHCS) { Expects(hcsToSend.is_initialized()); log (spdlog::level::level_enum::info, "{} ActiveStream::snapshot: Sending disk snapshot with start " "seqno {}, end seqno {}, and" " high completed seqno {}" , logPrefix, snapStart, snapEnd, hcsToSend); } Previously reported GDB backtrace looks incorrect, probably due to wrong debug symbols package. Investigating...
            drigby Dave Rigby added a comment - - edited

            Duplicate of MB-37103

            Edit: This MB is with an Ephemeral bucket; the other is with a persistent bucket; so they might actually be different root causes...

            drigby Dave Rigby added a comment - - edited Duplicate of MB-37103 Edit: This MB is with an Ephemeral bucket; the other is with a persistent bucket; so they might actually be different root causes...
            drigby Dave Rigby added a comment -

            (Continuing analysis on this instance of the bug given I already have an environment setup for those logs...)

            Backtrace with symbols is:

            (gdb) bt full
            ...
            #6  0x00007f5732e3e213 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f56ec0009f0, tinfo=tinfo@entry=0x7f572dd9d110 <typeinfo for gsl::fail_fast>, dest=dest@entry=0x7f572d8dfc90 <gsl::fail_fast::~fail_fast()>)
                at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
                    globals = <optimized out>
                    header = 0x7f56ec000970
            #7  0x00007f572d8fc098 in fail_fast_assert (message=0x7f572daf0a48 "GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1185", cond=false)
                at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite.h:473
            No locals.
            #8  ActiveStream::snapshot (this=this@entry=0x7f56f311b910, checkpointType=<optimized out>, items=..., highCompletedSeqno=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1185
                    flags = <optimized out>
                    snapEnd = 99
                    hcsToSend = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = false, m_storage = 140012460016000}, <No data fields>}
                    isCkptTypeDisk = <optimized out>
                    snapStart = 0
                    sendHCS = true
            #9  0x00007f572d8ff434 in ActiveStream::processItems (this=this@entry=0x7f56f311b910, outstandingItemsResult=..., streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1101
                    mutations = {<std::_Deque_base<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> >, std::allocator<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > > >> = {
                        _M_impl = {<std::allocator<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > >> = {<__gnu_cxx::new_allocator<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > >> = {<No data fields>}, <No data fields>}, 
                          _M_map = 0x7f57176d1400, _M_map_size = 8, _M_start = {_M_cur = 0x7f56f2b35000, _M_first = 0x7f56f2b35000, _M_last = 0x7f56f2b35200, _M_node = 0x7f57176d1418}, _M_finish = {_M_cur = 0x7f56f2b35718, _M_first = 0x7f56f2b35600, _M_last = 0x7f56f2b35800, 
                            _M_node = 0x7f57176d1420}}}, <No data fields>}
            #10 0x00007f572d8ff843 in ActiveStream::nextCheckpointItemTask (this=this@entry=0x7f56f311b910, streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:868
                    res = {checkpointType = Disk, items = {<std::_Vector_base<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> >, std::allocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > > >> = {
                          _M_impl = {<std::allocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > >> = {<__gnu_cxx::new_allocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > >> = {<No data fields>}, <No data fields>}, _M_start = 0x7f571770a800, 
                            _M_finish = 0x7f571770ab40, _M_end_of_storage = 0x7f571770ac00}}, <No data fields>}, highCompletedSeqno = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, 
                          m_initialized = false, m_storage = 140012460016000}, <No data fields>}}
                    vbucket = {<std::__shared_ptr<VBucket, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<VBucket, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f56fcddda00, _M_refcount = {_M_pi = 0x7f56fcddc440}}, <No data fields>}
            #11 0x00007f572d8ff924 in ActiveStream::nextCheckpointItemTask (this=0x7f56f311b910) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:858
                    lh = {_M_device = @0x7f56f311b978}
            #12 0x00007f572d9089f9 in ActiveStreamCheckpointProcessorTask::run (this=0x7f56f31ffa10) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream_checkpoint_processor_task.cc:56
                    rh = {<StreamContainer<std::shared_ptr<Stream> >::Iterable<std::_Fwd_list_const_iterator<std::shared_ptr<Stream> > >> = {itr = {_M_node = 0x7f5730e39a20}, endItr = {_M_node = 0x0}, before = {_M_node = <optimized out>}}, readLock = {_M_pm = 0x7f571770dd28, 
                        _M_owns = <optimized out>}, container = <optimized out>}
                    streams = {<std::__shared_ptr<StreamContainer<std::shared_ptr<Stream> >, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<StreamContainer<std::shared_ptr<Stream> >, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, 
                        _M_ptr = 0x7f571770dd10, _M_refcount = {_M_pi = 0x7f571770dd00}}, <No data fields>}
                    iterations = 0
                    expected = 16
            #13 0x00007f572d9b7964 in ExecutorThread::run (this=0x7f5730f9d960) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:187
                    curTaskDescr = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7f56f31a8c80 <Address 0x7f56f31a8c80 out of bounds>}, _M_string_length = 107, {
                        _M_local_buf = "k\000\000\000\000\000\000\000\333\324\354\062W\177\000", _M_allocated_capacity = 107}}
                    woketime = <optimized out>
                    scheduleOverhead = <optimized out>
                    again = <optimized out>
                    runtime = <optimized out>
                    q = <optimized out>
                    tick = 173 '\255'
            #14 0x00007f5734c60ee7 in run (this=0x7f5730e82fd0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
            No locals.
            #15 platform_thread_wrap (arg=0x7f5730e82fd0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
                    context = {_M_t = {
                        _M_t = {<std::_Tuple_impl<0, CouchbaseThread*, std::default_delete<CouchbaseThread> >> = {<std::_Tuple_impl<1, std::default_delete<CouchbaseThread> >> = {<std::_Head_base<1, std::default_delete<CouchbaseThread>, true>> = {<std::default_delete<CouchbaseThread>> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, CouchbaseThread*, false>> = {_M_head_impl = 0x7f5730e82fd0}, <No data fields>}, <No data fields>}}}
            #16 0x00007f5732681dd5 in start_thread (arg=0x7f56fe7fc700) at pthread_create.c:307
                    __res = <optimized out>
                    pd = 0x7f56fe7fc700
                    now = <optimized out>
                    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140011613701888, 4224282076292804119, 0, 8392704, 0, 140011613701888, -4309286418460742121, -4309435791131447785}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, 
                          canceltype = 0}}}
                    not_first_call = <optimized out>
                    pagesize_m1 = <optimized out>
                    sp = <optimized out>
                    freesize = <optimized out>
            #17 0x00007f57323aaead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
            

            Many of the variables have been optimised out, however we can see that the snapEnd=99.

            The logs immediately before the crash were for vb:850 (consumer at .242), so this is likely to be the vBucket affected:

            2019-12-01T22:08:14.136425-08:00 INFO 50: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
            2019-12-01T22:08:14.136487-08:00 INFO 50: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default" backfill:true, seqno:1
            2019-12-01T22:08:14.136509-08:00 INFO 50: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) ActiveStream::transitionState: Transitioning from backfilling to in-memory
            2019-12-01T22:08:14.137299-08:00 CRITICAL *** Fatal error encountered during exception handling ***
            

            Looking at the consumer node (.242), at the same time we do see the connection to .243 (producer) get closed; with streams for vbs 849, 850 and 852 getting closed:

            2019-12-01T22:08:14.212202-08:00 INFO 55: (No Engine) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - Removing connection [ 127.0.0.1:42686 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-12-01T22:08:14.212235-08:00 WARNING 55: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:852) Setting stream to dead state, last_seqno is 0, unAckedBytes is 0, status is The stream closed early because the conn was disconnected
            2019-12-01T22:08:14.212242-08:00 WARNING 55: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) Setting stream to dead state, last_seqno is 0, unAckedBytes is 0, status is The stream closed early because the conn was disconnected
            2019-12-01T22:08:14.212248-08:00 WARNING 55: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:849) Setting stream to dead state, last_seqno is 0, unAckedBytes is 0, status is The stream closed early because the conn was disconnected
            

            Assuming it ws vb:850 which hit this problem, the history of it on node .243 (producer) is:

            # Created as Pending:
            2019-12-01T22:07:05.874926-08:00 INFO (default) VBucket: created vb:850 with state:pending initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:122965773455347 topology:null
            # Populated from node .211, up to seqno 99:
            2019-12-01T22:07:05.877268-08:00 INFO 50: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Attempting to add stream: opaque_:42, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:122965773455347, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
            ...
            2019-12-01T22:07:06.132700-08:00 INFO 50: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Setting stream to dead state, last_seqno is 99, unAckedBytes is 0, status is The stream closed due to a close stream message
            ...
            # Changed to Active:
            2019-12-01T22:07:06.142155-08:00 INFO (default) VBucket::setState: transitioning vb:850 with high seqno:99 from:pending to:active
            # Topology added:
            2019-12-01T22:07:06.258430-08:00 INFO (default) VBucket::setState: transitioning vb:850 with high seqno:99 from:active to:active meta:{"topology":[["ns_1@172.23.104.243","ns_1@172.23.104.245"]]}
            ...
            # Crash occurs when setting up stream to .242:
            2019-12-01T22:08:14.136509-08:00 INFO 50: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) ActiveStream::transitionState: Transitioning from backfilling to in-memory
            2019-12-01T22:08:14.137299-08:00 CRITICAL *** Fatal error encountered during exception handling ***
            

            Following the sequence back, node .243 obtained vb:850 from node .211 at:

            2019-12-01T22:07:05.877268-08:00 INFO 50: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Attempting to add stream: opaque_:42, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:122965773455347, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
            

            Looking at node .211 we see the following snapshow range (note this is Ephemeral):

            2019-12-01T22:07:05.879986-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
            2019-12-01T22:07:05.880014-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default" backfill:true, seqno:100
            2019-12-01T22:07:05.880074-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Scheduling backfill from 1 to 99, reschedule flag : False
            2019-12-01T22:07:05.880576-08:00 INFO (default) vb:850 Created range iterator from 1 to 100
            2019-12-01T22:07:05.880601-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) ActiveStream::markDiskSnapshot: Sending disk snapshot with start seqno 0, end seqno 99, and high completed seqno --
            2019-12-01T22:07:05.880655-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default", backfill:true, seqno:100
            2019-12-01T22:07:05.881167-08:00 INFO (default) vb:850 Releasing the range iterator
            2019-12-01T22:07:05.881313-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Backfill complete, 0 items read from disk, 99 from memory, last seqno read: 99, pendingBackfill : False
            

            Note this log message which highlights the problem - the high completed seqno is null (--):

            2019-12-01T22:07:05.880601-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) ActiveStream::markDiskSnapshot: Sending disk snapshot with start seqno 0, end seqno 99, and high completed seqno --
            

            drigby Dave Rigby added a comment - (Continuing analysis on this instance of the bug given I already have an environment setup for those logs...) Backtrace with symbols is: (gdb) bt full ... #6 0x00007f5732e3e213 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f56ec0009f0, tinfo=tinfo@entry=0x7f572dd9d110 <typeinfo for gsl::fail_fast>, dest=dest@entry=0x7f572d8dfc90 <gsl::fail_fast::~fail_fast()>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93 globals = <optimized out> header = 0x7f56ec000970 #7 0x00007f572d8fc098 in fail_fast_assert (message=0x7f572daf0a48 "GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc: 1185", cond=false) at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite.h:473 No locals. #8 ActiveStream::snapshot (this=this@entry=0x7f56f311b910, checkpointType=<optimized out>, items=..., highCompletedSeqno=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1185 flags = <optimized out> snapEnd = 99 hcsToSend = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = false, m_storage = 140012460016000}, <No data fields>} isCkptTypeDisk = <optimized out> snapStart = 0 sendHCS = true #9 0x00007f572d8ff434 in ActiveStream::processItems (this=this@entry=0x7f56f311b910, outstandingItemsResult=..., streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:1101 mutations = {<std::_Deque_base<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> >, std::allocator<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > > >> = { _M_impl = {<std::allocator<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > >> = {<__gnu_cxx::new_allocator<std::unique_ptr<DcpResponse, std::default_delete<DcpResponse> > >> = {<No data fields>}, <No data fields>}, _M_map = 0x7f57176d1400, _M_map_size = 8, _M_start = {_M_cur = 0x7f56f2b35000, _M_first = 0x7f56f2b35000, _M_last = 0x7f56f2b35200, _M_node = 0x7f57176d1418}, _M_finish = {_M_cur = 0x7f56f2b35718, _M_first = 0x7f56f2b35600, _M_last = 0x7f56f2b35800, _M_node = 0x7f57176d1420}}}, <No data fields>} #10 0x00007f572d8ff843 in ActiveStream::nextCheckpointItemTask (this=this@entry=0x7f56f311b910, streamMutex=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:868 res = {checkpointType = Disk, items = {<std::_Vector_base<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> >, std::allocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > > >> = { _M_impl = {<std::allocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > >> = {<__gnu_cxx::new_allocator<SingleThreadedRCPtr<Item, Item*, std::default_delete<Item> > >> = {<No data fields>}, <No data fields>}, _M_start = 0x7f571770a800, _M_finish = 0x7f571770ab40, _M_end_of_storage = 0x7f571770ac00}}, <No data fields>}, highCompletedSeqno = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = false, m_storage = 140012460016000}, <No data fields>}} vbucket = {<std::__shared_ptr<VBucket, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<VBucket, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f56fcddda00, _M_refcount = {_M_pi = 0x7f56fcddc440}}, <No data fields>} #11 0x00007f572d8ff924 in ActiveStream::nextCheckpointItemTask (this=0x7f56f311b910) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream.cc:858 lh = {_M_device = @0x7f56f311b978} #12 0x00007f572d9089f9 in ActiveStreamCheckpointProcessorTask::run (this=0x7f56f31ffa10) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/dcp/active_stream_checkpoint_processor_task.cc:56 rh = {<StreamContainer<std::shared_ptr<Stream> >::Iterable<std::_Fwd_list_const_iterator<std::shared_ptr<Stream> > >> = {itr = {_M_node = 0x7f5730e39a20}, endItr = {_M_node = 0x0}, before = {_M_node = <optimized out>}}, readLock = {_M_pm = 0x7f571770dd28, _M_owns = <optimized out>}, container = <optimized out>} streams = {<std::__shared_ptr<StreamContainer<std::shared_ptr<Stream> >, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<StreamContainer<std::shared_ptr<Stream> >, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f571770dd10, _M_refcount = {_M_pi = 0x7f571770dd00}}, <No data fields>} iterations = 0 expected = 16 #13 0x00007f572d9b7964 in ExecutorThread::run (this=0x7f5730f9d960) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:187 curTaskDescr = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7f56f31a8c80 <Address 0x7f56f31a8c80 out of bounds>}, _M_string_length = 107, { _M_local_buf = "k\000\000\000\000\000\000\000\333\324\354\062W\177\000", _M_allocated_capacity = 107}} woketime = <optimized out> scheduleOverhead = <optimized out> again = <optimized out> runtime = <optimized out> q = <optimized out> tick = 173 '\255' #14 0x00007f5734c60ee7 in run (this=0x7f5730e82fd0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58 No locals. #15 platform_thread_wrap (arg=0x7f5730e82fd0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71 context = {_M_t = { _M_t = {<std::_Tuple_impl<0, CouchbaseThread*, std::default_delete<CouchbaseThread> >> = {<std::_Tuple_impl<1, std::default_delete<CouchbaseThread> >> = {<std::_Head_base<1, std::default_delete<CouchbaseThread>, true>> = {<std::default_delete<CouchbaseThread>> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, CouchbaseThread*, false>> = {_M_head_impl = 0x7f5730e82fd0}, <No data fields>}, <No data fields>}}} #16 0x00007f5732681dd5 in start_thread (arg=0x7f56fe7fc700) at pthread_create.c:307 __res = <optimized out> pd = 0x7f56fe7fc700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140011613701888, 4224282076292804119, 0, 8392704, 0, 140011613701888, -4309286418460742121, -4309435791131447785}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> #17 0x00007f57323aaead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Many of the variables have been optimised out, however we can see that the snapEnd=99 . The logs immediately before the crash were for vb:850 (consumer at .242), so this is likely to be the vBucket affected: 2019-12-01T22:08:14.136425-08:00 INFO 50: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none 2019-12-01T22:08:14.136487-08:00 INFO 50: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default" backfill:true, seqno:1 2019-12-01T22:08:14.136509-08:00 INFO 50: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) ActiveStream::transitionState: Transitioning from backfilling to in-memory 2019-12-01T22:08:14.137299-08:00 CRITICAL *** Fatal error encountered during exception handling *** Looking at the consumer node (.242), at the same time we do see the connection to .243 (producer) get closed; with streams for vbs 849, 850 and 852 getting closed: 2019-12-01T22:08:14.212202-08:00 INFO 55: (No Engine) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - Removing connection [ 127.0.0.1:42686 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-12-01T22:08:14.212235-08:00 WARNING 55: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:852) Setting stream to dead state, last_seqno is 0, unAckedBytes is 0, status is The stream closed early because the conn was disconnected 2019-12-01T22:08:14.212242-08:00 WARNING 55: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) Setting stream to dead state, last_seqno is 0, unAckedBytes is 0, status is The stream closed early because the conn was disconnected 2019-12-01T22:08:14.212248-08:00 WARNING 55: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:849) Setting stream to dead state, last_seqno is 0, unAckedBytes is 0, status is The stream closed early because the conn was disconnected Assuming it ws vb:850 which hit this problem, the history of it on node .243 (producer) is: # Created as Pending: 2019-12-01T22:07:05.874926-08:00 INFO (default) VBucket: created vb:850 with state:pending initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:122965773455347 topology:null # Populated from node .211, up to seqno 99: 2019-12-01T22:07:05.877268-08:00 INFO 50: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Attempting to add stream: opaque_:42, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:122965773455347, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} ... 2019-12-01T22:07:06.132700-08:00 INFO 50: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Setting stream to dead state, last_seqno is 99, unAckedBytes is 0, status is The stream closed due to a close stream message ... # Changed to Active: 2019-12-01T22:07:06.142155-08:00 INFO (default) VBucket::setState: transitioning vb:850 with high seqno:99 from:pending to:active # Topology added: 2019-12-01T22:07:06.258430-08:00 INFO (default) VBucket::setState: transitioning vb:850 with high seqno:99 from:active to:active meta:{"topology":[["ns_1@172.23.104.243","ns_1@172.23.104.245"]]} ... # Crash occurs when setting up stream to .242: 2019-12-01T22:08:14.136509-08:00 INFO 50: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.243->ns_1@172.23.104.242:default - (vb:850) ActiveStream::transitionState: Transitioning from backfilling to in-memory 2019-12-01T22:08:14.137299-08:00 CRITICAL *** Fatal error encountered during exception handling *** Following the sequence back, node .243 obtained vb:850 from node .211 at: 2019-12-01T22:07:05.877268-08:00 INFO 50: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Attempting to add stream: opaque_:42, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:122965773455347, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} Looking at node .211 we see the following snapshow range (note this is Ephemeral): 2019-12-01T22:07:05.879986-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none 2019-12-01T22:07:05.880014-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default" backfill:true, seqno:100 2019-12-01T22:07:05.880074-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Scheduling backfill from 1 to 99, reschedule flag : False 2019-12-01T22:07:05.880576-08:00 INFO (default) vb:850 Created range iterator from 1 to 100 2019-12-01T22:07:05.880601-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) ActiveStream::markDiskSnapshot: Sending disk snapshot with start seqno 0, end seqno 99, and high completed seqno -- 2019-12-01T22:07:05.880655-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default", backfill:true, seqno:100 2019-12-01T22:07:05.881167-08:00 INFO (default) vb:850 Releasing the range iterator 2019-12-01T22:07:05.881313-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) Backfill complete, 0 items read from disk, 99 from memory, last seqno read: 99, pendingBackfill : False Note this log message which highlights the problem - the high completed seqno is null (--): 2019-12-01T22:07:05.880601-08:00 INFO 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.104.211->ns_1@172.23.104.243:default - (vb:850) ActiveStream::markDiskSnapshot: Sending disk snapshot with start seqno 0, end seqno 99, and high completed seqno --
            drigby Dave Rigby added a comment -

            I suspect the following code is the cause of the issue - note the *highCompletedSeqno != 0 clause:

            // If the stream supports SyncRep then send the HCS in the
                    // SnapshotMarker if it is not 0
                    auto sendHCS = supportSyncReplication() && highCompletedSeqno &&
                                   *highCompletedSeqno != 0;
                    auto hcsToSend = sendHCS ? highCompletedSeqno : boost::none;
                    log(spdlog::level::level_enum::info,
                        "{} ActiveStream::markDiskSnapshot: Sending disk snapshot with "
                        "start seqno {}, end seqno {}, and"
                        " high completed seqno {}",
                        logPrefix,
                        startSeqno,
                        endSeqno,
                        hcsToSend);
            

            On a bucket which has never performed any SyncWrites, the HCS will indeed be zero. This will cause sendHCS to be false, and hence hcsToSend will be boost::none. This results in the Checkpoint on the recipient being marked as a Disk Snapshot but without a HCS, which ultimately causes the crash seen when this node later becomes Active and attempts to stream out a Snapshot marker.

            Note that http://review.couchbase.org/#/c/113492/ added this clause, to avoid an assertion in the consumer (consumer) flusher attempting to flush a HCS not greater than the current HCS - which is false for the initial snapshot (0 > 0):

                            if (hcs) {
                                Expects(hcs > vbstate.persistedCompletedSeqno);
            

            drigby Dave Rigby added a comment - I suspect the following code is the cause of the issue - note the *highCompletedSeqno != 0 clause: // If the stream supports SyncRep then send the HCS in the // SnapshotMarker if it is not 0 auto sendHCS = supportSyncReplication() && highCompletedSeqno && *highCompletedSeqno != 0; auto hcsToSend = sendHCS ? highCompletedSeqno : boost::none; log (spdlog::level::level_enum::info, "{} ActiveStream::markDiskSnapshot: Sending disk snapshot with " "start seqno {}, end seqno {}, and" " high completed seqno {}" , logPrefix, startSeqno, endSeqno, hcsToSend); On a bucket which has never performed any SyncWrites, the HCS will indeed be zero. This will cause sendHCS to be false, and hence hcsToSend will be boost::none . This results in the Checkpoint on the recipient being marked as a Disk Snapshot but without a HCS, which ultimately causes the crash seen when this node later becomes Active and attempts to stream out a Snapshot marker. Note that http://review.couchbase.org/#/c/113492/ added this clause, to avoid an assertion in the consumer (consumer) flusher attempting to flush a HCS not greater than the current HCS - which is false for the initial snapshot (0 > 0): if (hcs) { Expects(hcs > vbstate.persistedCompletedSeqno);
            Balakumaran.Gopal Balakumaran Gopal added a comment - Validated this on 6.5.0-4917. Logs : http://qa.sc.couchbase.com/job/temp_rebalance_even/1770/consoleText

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty