Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40286

VBucket::commit (vb:924) failed as no HashTable item found with key

    XMLWordPrintable

Details

    Description

      Script to Repro

      ./testrunner -i /tmp/win10-bucket-ops.ini rerun=False,crash_warning=True,quota_percent=95 -t 
      failover.AutoFailoverTests.AutoFailoverTests. test_autofailover_during_rebalance,timeout=5,num_node_failures=1,nodes_in=0,nodes_out=1,auto_reprovision=True,failover_action=restart_server,nodes_init=4,can_abort_rebalance=False,override_spec_params=durability,durability=MAJORITY,bucket_spec=multi_bucket.buckets_all_ephemeral,GROUP=auto_reprovision_durability
      

      ./testrunner -i /tmp/win10-bucket-ops.ini rerun=False,crash_warning=True,quota_percent=95 -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover_during_rebalance,timeout=5,num_node_failures=1,nodes_in=0,nodes_out=1,failover_action=restart_server,nodes_init=4,can_abort_rebalance=False,override_spec_params=durability,durability=MAJORITY,bucket_spec=single_bucket.def_scope_fifty_collections,GROUP=auto_failover_durability
      

      Memcached Log on 172.23.104.186:

      memcached.log.000207.txt:2020-07-04T16:54:30.772315-07:00 CRITICAL *** Fatal error encountered during exception handling ***
      memcached.log.000207.txt:2020-07-04T16:54:30.772390-07:00 CRITICAL Caught unhandled std::exception-derived exception. what(): ActiveDurabilityMonitor::commit vb:924 failed with status:1
      memcached.log.000207.txt:2020-07-04T16:54:30.975240-07:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-2524). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/ada15f48-bf3d-4625-91f97780-782054e8.dmp before terminating.
      memcached.log.000207.txt:2020-07-04T16:54:30.975269-07:00 CRITICAL Stack backtrace of crashed thread:
      memcached.log.000207.txt:2020-07-04T16:54:30.975521-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x13cd8d]
      memcached.log.000207.txt:2020-07-04T16:54:30.975544-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x15206a]
      memcached.log.000207.txt:2020-07-04T16:54:30.975560-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x1523a8]
      memcached.log.000207.txt:2020-07-04T16:54:30.975574-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7f90ab565000+0xf630]
      memcached.log.000207.txt:2020-07-04T16:54:30.975619-07:00 CRITICAL     /lib64/libc.so.6(gsignal+0x37) [0x7f90ab197000+0x36387]
      memcached.log.000207.txt:2020-07-04T16:54:30.975658-07:00 CRITICAL     /lib64/libc.so.6(abort+0x148) [0x7f90ab197000+0x37a78]
      memcached.log.000207.txt:2020-07-04T16:54:30.975704-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7f90abc9a000+0x91195]
      memcached.log.000207.txt:2020-07-04T16:54:30.975721-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x14d562]
      memcached.log.000207.txt:2020-07-04T16:54:30.975757-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f90abc9a000+0x8ef86]
      memcached.log.000207.txt:2020-07-04T16:54:30.975794-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f90abc9a000+0x8efd1]
      memcached.log.000207.txt:2020-07-04T16:54:30.975825-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f90abc9a000+0x8f213]
      memcached.log.000207.txt:2020-07-04T16:54:30.975841-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f90af3fd000+0xdd260]
      memcached.log.000207.txt:2020-07-04T16:54:30.975854-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f90af3fd000+0xd79b2]
      memcached.log.000207.txt:2020-07-04T16:54:30.975863-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f90af3fd000+0xd7b13]
      memcached.log.000207.txt:2020-07-04T16:54:30.975877-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f90af3fd000+0x1a3a60]
      memcached.log.000207.txt:2020-07-04T16:54:30.975887-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f90af3fd000+0xdeefd]
      memcached.log.000207.txt:2020-07-04T16:54:30.975899-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f90af3fd000+0x148d83]
      memcached.log.000207.txt:2020-07-04T16:54:30.975909-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f90af3fd000+0x13f5af]
      memcached.log.000207.txt:2020-07-04T16:54:30.975921-07:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f90adf1f000+0x10777]
      memcached.log.000207.txt:2020-07-04T16:54:30.975933-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7f90ab565000+0x7ea5]
      memcached.log.000207.txt:2020-07-04T16:54:30.976031-07:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f90ab197000+0xfe8dd]
      

      bt full

      (gdb) bt full
      #0  0x00007f22205f2387 in raise () from /lib64/libc.so.6
      No symbol table info available.
      #1  0x00007f22205f3a78 in abort () from /lib64/libc.so.6
      No symbol table info available.
      #2  0x00007f2221150195 in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
              terminating = false
              t = <optimized out>
      #3  0x000000000054d562 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86
      No locals.
      #4  0x00007f222114df86 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
      No locals.
      #5  0x00007f222114dfd1 in std::terminate () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
      No locals.
      #6  0x00007f222114e213 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f21740dc910, tinfo=0x7ff910 <typeinfo for std::logic_error>, dest=0x418e50 <_ZNSt11logic_errorD1Ev@plt>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
              globals = <optimized out>
              header = 0x7f21740dc890
      #7  0x00007f2224879fd6 in ThrowExceptionPolicy<long>::nonMonotonic (this=<optimized out>, curValue=<optimized out>, newValue=<optimized out>)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/monotonic.h:53
      No locals.
      #8  0x00007f22248f9943 in operator= (other=..., this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/monotonic.h:136
      No locals.
      #9  updateHighCompletedSeqno (this=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/active_durability_monitor.cc:1518
      No locals.
      #10 ActiveDurabilityMonitor::commit (this=this@entry=0x7f21cb44f460, sw=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/active_durability_monitor.cc:863
              s = {
                holder_ = {<folly::LockedPtrBase<folly::Synchronized<std::unique_ptr<ActiveDurabilityMonitor::State, std::default_delete<ActiveDurabilityMonitor::State> >, folly::SharedMutexImpl<false, void, std::atomic, false, false> >, folly::SharedMutexImpl<false, void, std::atomic, false, false>, folly::LockPolicyExclusive>> = {parent_ = 0x7f21cb44f498}, static AllowsConcurrentAccess = false}}
              key = <optimized out>
              cHandle = {<Collections::VB::Manifest::ReadHandle> = {readLock = {lock_ = 0x7f21cb304c64, token_ = {type_ = folly::SharedMutexToken::DEFERRED_SHARED, slot_ = 16}}, manifest = 0x7f21cb304be0}, 
                itr = {<folly::f14::detail::BaseIter<std::pair<CollectionID const, Collections::VB::ManifestEntry> const*, unsigned int>> = {<std::iterator<std::forward_iterator_tag, std::pair<CollectionID const, Collections::VB::ManifestEntry>, long, std::pair<CollectionID const, Collections::VB::ManifestEntry> const*, std::pair<CollectionID const, Collections::VB::ManifestEntry> const&>> = {<No data fields>}, <No data fields>}, current_ = 0x7f21c1602da0, 
                  lowest_ = 0x7f21c1602200}, key = {<DocKeyInterface<DocKey>> = {<No data fields>}, buffer = {static npos = <optimized out>, buf = 0x7f21d9601320 <Address 0x7f21d9601320 out of bounds>, len = 22}, encoding = Yes}}
              prepareEnd = {__d = {__r = 144042419775393}}
              cookie = <optimized out>
              result = <optimized out>
              __func__ = "commit"
              index = <optimized out>
              commitDuration = <optimized out>
      #11 0x00007f22248f9b13 in ActiveDurabilityMonitor::processCompletedSyncWriteQueue (this=0x7f21cb44f460) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/active_durability_monitor.cc:448
              sw = {storage_ = {{emptyState = -48 '\320', value = {<DurabilityMonitor::SyncWrite> = {item = {value = 0x7f21be84a6d0}, startTime = {__d = {__r = 140602082705333}}, status = ToCommit}, cookie = 0x0, firstChain = {
                        ackCount = {<ThrowExceptionPolicy<unsigned char>> = {label = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
                                _M_p = 0x7f21d9fd8ed0 "unlabelled"}, _M_string_length = 10, {_M_local_buf = "unlabelled\000\313!\177\000", _M_allocated_capacity = 7812730926796664437}}}, val = 2 '\002'}, chainPtr = 0x7f21b18e9440}, 
                      secondChain = {ackCount = {<ThrowExceptionPolicy<unsigned char>> = {label = {static npos = 18446744073709551615, 
                              _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7f21d9fd8f00 "unlabelled"}, _M_string_length = 10, {
                                _M_local_buf = "unlabelled\000!\"\177\000", _M_allocated_capacity = 7812730926796664437}}}, val = 0 '\000'}, chainPtr = 0x0}, 
                      expiryTime = {<std::_Optional_base<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1, 1000000000> > > >> = {_M_payload = {{_M_empty = {<No data fields>}, _M_payload = {
                                __d = {__r = 139783120942176}}}, 
                            _M_engaged = false}}, <std::_Enable_copy_move<true, true, true, true, std::optional<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1, 1000000000> > > > >> = {<No data fields>}, <No data fields>}}}, hasValue = true}}
              lock = {_M_device = @0x7f21fc003028}
      #12 0x00007f22249c5a60 in VBucket::processResolvedSyncWrites (this=0x7f21cb441200) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/vbucket.cc:700
              rlh = {lock_ = 0x7f21cb4413d8, token_ = {type_ = folly::SharedMutexToken::INLINE_SHARED, slot_ = 8805}}
      #13 0x00007f2224900efd in DurabilityCompletionTask::run (this=0x7f2215736510) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/durability_completion_task.cc:53
              vb = {<std::__shared_ptr<VBucket, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<VBucket, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f21cb441200, _M_refcount = {
                    _M_pi = 0x7f21cb458ea0}}, <No data fields>}
              runtime = <optimized out>
              startTime = {__d = {__r = 144042419764058}}
              pendingVb = {vbid = 557}
      #14 0x00007f222496ad83 in GlobalTask::execute (this=0x7f2215736510) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/globaltask.cc:73
              guard = {previous = 0x0}
      #15 0x00007f22249615af in ExecutorThread::run (this=0x7f221f1f6880) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:188
              curTaskDescr = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
                  _M_p = 0x7f221ef0ba40 <Address 0x7f221ef0ba40 out of bounds>}, _M_string_length = 24, {_M_local_buf = "\030\000\000\000\000\000\000\000pressor", _M_allocated_capacity = 24}}
              woketime = <optimized out>
      ---Type <return> to continue, or q <return> to quit---
              scheduleOverhead = <optimized out>
              again = <optimized out>
              runtime = <optimized out>
              q = <optimized out>
              tick = 128 '\200'
              guard = {engine = 0x0}
      #16 0x00007f2223354777 in run (this=0x7f221fe4a0a0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
      No locals.
      #17 platform_thread_wrap (arg=0x7f221fe4a0a0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
              context = {_M_t = {
                  _M_t = {<std::_Tuple_impl<0, CouchbaseThread*, std::default_delete<CouchbaseThread> >> = {<std::_Tuple_impl<1, std::default_delete<CouchbaseThread> >> = {<std::_Head_base<1, std::default_delete<CouchbaseThread>, true>> = {<std::default_delete<CouchbaseThread>> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, CouchbaseThread*, false>> = {_M_head_impl = 0x7f221fe4a0a0}, <No data fields>}, <No data fields>}}}
      #18 0x00007f2220991ea5 in start_thread () from /lib64/libpthread.so.0
      No symbol table info available.
      #19 0x00007f22206ba8dd in clone () from /lib64/libc.so.6
      No symbol table info available.
      (gdb) 
      

      Attachments

        1. Automation_logs.zip
          11 kB
        2. consoleText.txt.zip
          2.65 MB
        3. pcaps.zip
          14.04 MB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            drigby Dave Rigby added a comment -

            From cbcollect_info_ns_1@172.23.98.195_20200716-052816:

            2020-07-15T22:19:24.911469-07:00 INFO (default) VBucket::setState: transitioning vb:940 with high seqno:1908 from:pending to:active
            2020-07-15T22:19:25.507969-07:00 INFO (default) VBucket::setState: transitioning vb:940 with high seqno:1908 from:active to:active meta:{"topology":[["ns_1@172.23.98.195","ns_1@172.23.120.206","ns_1@172.23.104.186"]]}
            2020-07-15T22:19:25.531828-07:00 INFO 1568: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.120.206:default - (vb:940) Creating stream with start seqno 1908 and end seqno 18446744073709551615; requested 
            end seqno was 18446744073709551615, collections-manifest uid:none, filter:0 sid:none
            2020-07-15T22:19:25.532133-07:00 INFO 1568: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.120.206:default - (vb:940) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replicati
            on:ns_1@172.23.98.195->ns_1@172.23.120.206:default" backfill:false, seqno:1909
            2020-07-15T22:19:25.532187-07:00 INFO 1568: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.120.206:default - (vb:940) ActiveStream::transitionState: Transitioning from backfilling to in-memory
            2020-07-15T22:19:25.538050-07:00 ERROR (default) VBucket::commit (vb:940) failed as no HashTable item found with key:<ud>cid:0x36:test_collections-273</ud> prepare_seqno:1122, commit_seqno:none
            2020-07-15T22:19:25.538292-07:00 CRITICAL *** Fatal error encountered during exception handling ***
            2020-07-15T22:19:25.538353-07:00 CRITICAL Caught unhandled std::exception-derived exception. what(): ActiveDurabilityMonitor::commit vb:940 failed with status:1
            2020-07-15T22:19:25.837466-07:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-2620). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/9ab1a5a8-9e1b-434e-c616d990-e35e02d0.dmp before terminating.
            2020-07-15T22:19:25.837484-07:00 CRITICAL Stack backtrace of crashed thread:
            2020-07-15T22:19:25.837685-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x13ce4d]
            2020-07-15T22:19:25.837705-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x1523ea]
            2020-07-15T22:19:25.837721-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x152728]
            2020-07-15T22:19:25.837733-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7fd0350c3000+0xf630]
            2020-07-15T22:19:25.837775-07:00 CRITICAL     /lib64/libc.so.6(gsignal+0x37) [0x7fd034cf5000+0x36387]
            2020-07-15T22:19:25.837814-07:00 CRITICAL     /lib64/libc.so.6(abort+0x148) [0x7fd034cf5000+0x37a78]
            2020-07-15T22:19:25.837872-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7fd0357f8000+0x91195]
            2020-07-15T22:19:25.837891-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x14d8e2]
            2020-07-15T22:19:25.837961-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fd0357f8000+0x8ef86]
            2020-07-15T22:19:25.838026-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fd0357f8000+0x8efd1]
            2020-07-15T22:19:25.838087-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fd0357f8000+0x8f213]
            2020-07-15T22:19:25.838104-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0xdd710]
            2020-07-15T22:19:25.838115-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0xd7f92]
            2020-07-15T22:19:25.838125-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0xd80f3]
            2020-07-15T22:19:25.838138-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0x1b4040]
            2020-07-15T22:19:25.838149-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0xdf12d]
            2020-07-15T22:19:25.838160-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0x14e2e3]
            2020-07-15T22:19:25.838170-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0x144a8f]
            2020-07-15T22:19:25.838181-07:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7fd037a7d000+0x10777]
            2020-07-15T22:19:25.838192-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7fd0350c3000+0x7ea5]
            2020-07-15T22:19:25.838243-07:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7fd034cf5000+0xfe8dd]
            

            drigby Dave Rigby added a comment - From cbcollect_info_ns_1@172.23.98.195_20200716-052816: 2020-07-15T22:19:24.911469-07:00 INFO (default) VBucket::setState: transitioning vb:940 with high seqno:1908 from:pending to:active 2020-07-15T22:19:25.507969-07:00 INFO (default) VBucket::setState: transitioning vb:940 with high seqno:1908 from:active to:active meta:{"topology":[["ns_1@172.23.98.195","ns_1@172.23.120.206","ns_1@172.23.104.186"]]} 2020-07-15T22:19:25.531828-07:00 INFO 1568: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.120.206:default - (vb:940) Creating stream with start seqno 1908 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, filter:0 sid:none 2020-07-15T22:19:25.532133-07:00 INFO 1568: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.120.206:default - (vb:940) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replicati on:ns_1@172.23.98.195->ns_1@172.23.120.206:default" backfill:false, seqno:1909 2020-07-15T22:19:25.532187-07:00 INFO 1568: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.98.195->ns_1@172.23.120.206:default - (vb:940) ActiveStream::transitionState: Transitioning from backfilling to in-memory 2020-07-15T22:19:25.538050-07:00 ERROR (default) VBucket::commit (vb:940) failed as no HashTable item found with key:<ud>cid:0x36:test_collections-273</ud> prepare_seqno:1122, commit_seqno:none 2020-07-15T22:19:25.538292-07:00 CRITICAL *** Fatal error encountered during exception handling *** 2020-07-15T22:19:25.538353-07:00 CRITICAL Caught unhandled std::exception-derived exception. what(): ActiveDurabilityMonitor::commit vb:940 failed with status:1 2020-07-15T22:19:25.837466-07:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-2620). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/9ab1a5a8-9e1b-434e-c616d990-e35e02d0.dmp before terminating. 2020-07-15T22:19:25.837484-07:00 CRITICAL Stack backtrace of crashed thread: 2020-07-15T22:19:25.837685-07:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x13ce4d] 2020-07-15T22:19:25.837705-07:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x1523ea] 2020-07-15T22:19:25.837721-07:00 CRITICAL /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x152728] 2020-07-15T22:19:25.837733-07:00 CRITICAL /lib64/libpthread.so.0() [0x7fd0350c3000+0xf630] 2020-07-15T22:19:25.837775-07:00 CRITICAL /lib64/libc.so.6(gsignal+0x37) [0x7fd034cf5000+0x36387] 2020-07-15T22:19:25.837814-07:00 CRITICAL /lib64/libc.so.6(abort+0x148) [0x7fd034cf5000+0x37a78] 2020-07-15T22:19:25.837872-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7fd0357f8000+0x91195] 2020-07-15T22:19:25.837891-07:00 CRITICAL /opt/couchbase/bin/memcached() [0x400000+0x14d8e2] 2020-07-15T22:19:25.837961-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fd0357f8000+0x8ef86] 2020-07-15T22:19:25.838026-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fd0357f8000+0x8efd1] 2020-07-15T22:19:25.838087-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fd0357f8000+0x8f213] 2020-07-15T22:19:25.838104-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0xdd710] 2020-07-15T22:19:25.838115-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0xd7f92] 2020-07-15T22:19:25.838125-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0xd80f3] 2020-07-15T22:19:25.838138-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0x1b4040] 2020-07-15T22:19:25.838149-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0xdf12d] 2020-07-15T22:19:25.838160-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0x14e2e3] 2020-07-15T22:19:25.838170-07:00 CRITICAL /opt/couchbase/bin/../lib/libep.so() [0x7fd038f5c000+0x144a8f] 2020-07-15T22:19:25.838181-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7fd037a7d000+0x10777] 2020-07-15T22:19:25.838192-07:00 CRITICAL /lib64/libpthread.so.0() [0x7fd0350c3000+0x7ea5] 2020-07-15T22:19:25.838243-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7fd034cf5000+0xfe8dd]
            ritesh.agarwal Ritesh Agarwal added a comment - - edited

            Noticed the below logs on 4092 build with couchstore on node 172.23.107.78:

            Found 'ERROR' logs - ['2020-12-21T07:38:19.860497-08:00 ERROR (default) VBucket::commit (vb:825) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-8105845</ud> prepare_seqno:21308, commit_seqno:21861\n']
            

            QE test:

            bucket_storage=couchstore,rerun=false,GROUP=P0;durability,randomize_value=true,doc_size=256,bucket_eviction_policy=fullEviction,nodes_init=4,graceful=False,upgrade_version=7.0.0-4072 -t magma.magma_crash_recovery.MagmaCrashTests.test_crash_during_recovery,num_items=5000000,doc_size=1024,sdk_timeout=60,doc_ops=update:delete:expiry,replicas=1,durability=MAJORITY_AND_PERSIST_TO_ACTIVE,GROUP=P0;durability
            

            Logs:
            http://qa.sc.couchbase.com/job/test_suite_executor-TAF/80094/artifact/job_logs/testrunner-20-Dec-21_03-41-47/test_6/

            ritesh.agarwal Ritesh Agarwal added a comment - - edited Noticed the below logs on 4092 build with couchstore on node 172.23.107.78 : Found 'ERROR' logs - ['2020-12-21T07:38:19.860497-08:00 ERROR (default) VBucket::commit (vb:825) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-8105845</ud> prepare_seqno:21308, commit_seqno:21861\n'] QE test: bucket_storage=couchstore,rerun=false,GROUP=P0;durability,randomize_value=true,doc_size=256,bucket_eviction_policy=fullEviction,nodes_init=4,graceful=False,upgrade_version=7.0.0-4072 -t magma.magma_crash_recovery.MagmaCrashTests.test_crash_during_recovery,num_items=5000000,doc_size=1024,sdk_timeout=60,doc_ops=update:delete:expiry,replicas=1,durability=MAJORITY_AND_PERSIST_TO_ACTIVE,GROUP=P0;durability Logs: http://qa.sc.couchbase.com/job/test_suite_executor-TAF/80094/artifact/job_logs/testrunner-20-Dec-21_03-41-47/test_6/
            james.harrison James Harrison added a comment - - edited

            The above failure was preceded by a reconnection of the Consumer stream

            172.23.107.78

            2020-12-21T07:37:44.374274-08:00 WARNING 2604: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Setting stream to dead state, last_seqno is 21313, unAckedBytes is 0, status is The stream closed early because the conn was disconnected
            ...
            2020-12-21T07:38:19.796640-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Attempting to add stream: opaque_:60, start_seqno_:21313, end_seqno_:18446744073709551615, vb_uuid:95167882074391, snap_start_seqno_:21313, snap_end_seqno_:21313, last_seqno:21313, stream_req_value:{"uid":"0"}
            ...
            2020-12-21T07:38:19.860497-08:00 ERROR (default) VBucket::commit (vb:825) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-8105845</ud> prepare_seqno:21308, commit_seqno:21861
            2020-12-21T07:38:19.860540-08:00 WARNING 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - PassiveStream::processCommit: vb:825 Got error 'no such key' while trying to process commit
            

            We can see that the producer, 172.23.107.95, lost some data in other vbuckets

            172.23.107.78

            2020-12-21T07:38:19.810950-08:00 WARNING (default) (vb:249) Node ns_1@172.23.107.95 acked seqno:22173 lower than previous ack seqno:22196 (first chain)
            2020-12-21T07:38:19.810961-08:00 WARNING (default) (vb:250) Node ns_1@172.23.107.95 acked seqno:21905 lower than previous ack seqno:21921 (first chain)
            2020-12-21T07:38:19.810980-08:00 WARNING (default) (vb:252) Node ns_1@172.23.107.95 acked seqno:21947 lower than previous ack seqno:21967 (first chain)
            2020-12-21T07:38:19.810991-08:00 WARNING (default) (vb:253) Node ns_1@172.23.107.95 acked seqno:21539 lower than previous ack seqno:21562 (first chain)
            2020-12-21T07:38:19.811004-08:00 WARNING (default) (vb:254) Node ns_1@172.23.107.95 acked seqno:21894 lower than previous ack seqno:21916 (first chain)
            2020-12-21T07:38:19.811015-08:00 WARNING (default) (vb:255) Node ns_1@172.23.107.95 acked seqno:22346 lower than previous ack seqno:22366 (first chain)
            2020-12-21T07:38:19.836577-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:768) Received rollback request. Rolling back to seqno:21717
            2020-12-21T07:38:19.836714-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:769) Received rollback request. Rolling back to seqno:21607
            2020-12-21T07:38:19.836813-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:770) Received rollback request. Rolling back to seqno:21812
            2020-12-21T07:38:19.836925-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:771) Received rollback request. Rolling back to seqno:21806
            2020-12-21T07:38:19.837029-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:772) Received rollback request. Rolling back to seqno:21812
            2020-12-21T07:38:19.837122-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:773) Received rollback request. Rolling back to seqno:21980
            2020-12-21T07:38:19.837219-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:774) Received rollback request. Rolling back to seqno:21636
            2020-12-21T07:38:19.837317-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:775) Received rollback request. Rolling back to seqno:21533
            

            Over on the producer, we see the stream being recreated, and a backfill completing.

            172.23.107.95

            2020-12-21T07:38:19.828043-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Creating stream with start seqno 21313 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, filter:0 sid:none
            2020-12-21T07:38:19.828115-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default" backfill:true, seqno:21861
            2020-12-21T07:38:19.828180-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Scheduling backfill from 21314 to 21860, reschedule flag : False
            ...
            2020-12-21T07:38:19.828353-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 21313, end 21860, and high completed 21301, max visible 21857
            ...
            2020-12-21T07:38:19.828369-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default", backfill:true, seqno:21861
            

            And the connection is closed after the not found error is received

            172.23.107.95

            2020-12-21T07:38:19.835653-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Backfill complete, 547 items read from disk, 0 from memory, last seqno read: 21860, pendingBackfill : False
            2020-12-21T07:38:19.835680-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) ActiveStream::transitionState: Transitioning from backfilling to in-memory
             
            2020-12-21T07:38:19.861513-08:00 ERROR 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":60,"opcode":"DCP_COMMIT","status":"Not found"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default, vb:825, state:in-memory
            2020-12-21T07:38:19.861535-08:00 WARNING 3214: process_bin_dcp_response - response_handler returned ENGINE_DISCONNECT - closing connection [ {"ip":"172.23.107.78","port":59242} - {"ip":"172.23.107.95","port":11209} (<ud>@ns_server</ud>) ]
            2020-12-21T07:38:19.861561-08:00 INFO 3214: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - Removing connection [ {"ip":"172.23.107.78","port":59242} - {"ip":"172.23.107.95","port":11209} (<ud>@ns_server</ud>) ]
            2020-12-21T07:38:19.861613-08:00 WARNING 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Stream closing, sent until seqno 21861 remaining items 0, reason: The stream closed early because the conn was disconnected
            

            Based on the registerCursor logging, the failing commit at seqno 21861 is the first item sent from memory. The matching prepare at 21308 should have been received by the consumer before the reconnection from in-memory streaming, and was not rolled back. The backfill commences at 21314.


            Thanks to Ben Huddleston for a run down on MB-41942 regarding missing failover table entries.

            The active actually restarted more than once in quick succession:

            172.23.107.95

            2020-12-21T07:37:46.074671-08:00 INFO (default) VBucket: created vb:825 with state:active initialState:active lastSeqno:21308 persistedRange:{21308,21308} max_cas:1608565063637532672 uuid:95167882074391 topology:[["ns_1@172.23.107.95","ns_1@172.23.107.78"]]
            2020-12-21T07:37:46.074699-08:00 INFO (default) Warmup::createVBuckets: vb:825 created new failover entry with uuid:34147705071032 and seqno:21308 due to unclean shutdown
            2020-12-21T07:38:06.266106-08:00 INFO (default) VBucket: created vb:825 with state:active initialState:active lastSeqno:21857 persistedRange:{21857,21857} max_cas:1608565067267112961 uuid:95167882074391 topology:[["ns_1@172.23.107.95","ns_1@172.23.107.78"]]
            2020-12-21T07:38:06.266134-08:00 INFO (default) Warmup::createVBuckets: vb:825 created new failover entry with uuid:202953901787434 and seqno:21857 due to unclean shutdown
            

            A failover table entry is created at uuid:34147705071032, but by the time of the next restart, the uuid logged is once again as uuid:95167882074391.

            and a stream to the replica was opened at a seqno between the two failover table entries:

            2020-12-21T07:38:19.828043-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Creating stream with start seqno 21313 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, filter:0 sid:none
            

            21308 < 21313 < 21857
            

            The replica may at that point have a diverging history but still be allowed to continue streaming. The items the replica has at seqnos 21308-21313 may be items the active lost during the first failover. Normally, the replica would be required to rollback because the failover table indicates the replica has seen items at a given seqno, but has not provided the uuid the active has in the failover table for that range of seqnos. However, because the failover table entry was lost due to MB-41942, the replica was instead allowed to continue streaming - the active incorrectly determined that the replica was only behind, not diverging.

            For the sake of a bit of ascii art, the "true" view of history is:
            (seqnos increasing to the right, forks indicating a failover losing items and creating a new uuid at a given seqno)

            +---------------+------>   Timeline A
                            |
                            +------+---->       B
                                   |
                                   +--------->  C
            

            Instead, the active thinks it is as follows:

            +---------------+XXXXXX->  Timeline A
                                   |
                             ??????             B
                                   |
                                   +--------->  C
            
            

            Here, the active still has the items in the range of seqnos covered by ??????, whereas the replica has, at the same seqnos, the different items XXXXXX which the active actually lost in the first failover.

            Because of this inconsistency, the replica may have completed a prepare and removed it from the HT, while the active has since lost that completion to a failover event and has recommitted it after warming up. Upon streaming this new commit the replica fails as seen as the prepare is no longer in the HT.

            As such, this latest set of logs may be a dupe of MB-41942, and are likely unrelated to earlier logs in this ticket.

            james.harrison James Harrison added a comment - - edited The above failure was preceded by a reconnection of the Consumer stream 172.23.107.78 2020-12-21T07:37:44.374274-08:00 WARNING 2604: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Setting stream to dead state, last_seqno is 21313, unAckedBytes is 0, status is The stream closed early because the conn was disconnected ... 2020-12-21T07:38:19.796640-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Attempting to add stream: opaque_:60, start_seqno_:21313, end_seqno_:18446744073709551615, vb_uuid:95167882074391, snap_start_seqno_:21313, snap_end_seqno_:21313, last_seqno:21313, stream_req_value:{"uid":"0"} ... 2020-12-21T07:38:19.860497-08:00 ERROR (default) VBucket::commit (vb:825) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-8105845</ud> prepare_seqno:21308, commit_seqno:21861 2020-12-21T07:38:19.860540-08:00 WARNING 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - PassiveStream::processCommit: vb:825 Got error 'no such key' while trying to process commit We can see that the producer, 172.23.107.95, lost some data in other vbuckets 172.23.107.78 2020-12-21T07:38:19.810950-08:00 WARNING (default) (vb:249) Node ns_1@172.23.107.95 acked seqno:22173 lower than previous ack seqno:22196 (first chain) 2020-12-21T07:38:19.810961-08:00 WARNING (default) (vb:250) Node ns_1@172.23.107.95 acked seqno:21905 lower than previous ack seqno:21921 (first chain) 2020-12-21T07:38:19.810980-08:00 WARNING (default) (vb:252) Node ns_1@172.23.107.95 acked seqno:21947 lower than previous ack seqno:21967 (first chain) 2020-12-21T07:38:19.810991-08:00 WARNING (default) (vb:253) Node ns_1@172.23.107.95 acked seqno:21539 lower than previous ack seqno:21562 (first chain) 2020-12-21T07:38:19.811004-08:00 WARNING (default) (vb:254) Node ns_1@172.23.107.95 acked seqno:21894 lower than previous ack seqno:21916 (first chain) 2020-12-21T07:38:19.811015-08:00 WARNING (default) (vb:255) Node ns_1@172.23.107.95 acked seqno:22346 lower than previous ack seqno:22366 (first chain) 2020-12-21T07:38:19.836577-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:768) Received rollback request. Rolling back to seqno:21717 2020-12-21T07:38:19.836714-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:769) Received rollback request. Rolling back to seqno:21607 2020-12-21T07:38:19.836813-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:770) Received rollback request. Rolling back to seqno:21812 2020-12-21T07:38:19.836925-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:771) Received rollback request. Rolling back to seqno:21806 2020-12-21T07:38:19.837029-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:772) Received rollback request. Rolling back to seqno:21812 2020-12-21T07:38:19.837122-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:773) Received rollback request. Rolling back to seqno:21980 2020-12-21T07:38:19.837219-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:774) Received rollback request. Rolling back to seqno:21636 2020-12-21T07:38:19.837317-08:00 INFO 2601: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:775) Received rollback request. Rolling back to seqno:21533 Over on the producer, we see the stream being recreated, and a backfill completing. 172.23.107.95 2020-12-21T07:38:19.828043-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Creating stream with start seqno 21313 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, filter:0 sid:none 2020-12-21T07:38:19.828115-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default" backfill:true, seqno:21861 2020-12-21T07:38:19.828180-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Scheduling backfill from 21314 to 21860, reschedule flag : False ... 2020-12-21T07:38:19.828353-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 21313, end 21860, and high completed 21301, max visible 21857 ... 2020-12-21T07:38:19.828369-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default", backfill:true, seqno:21861 And the connection is closed after the not found error is received 172.23.107.95 2020-12-21T07:38:19.835653-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Backfill complete, 547 items read from disk, 0 from memory, last seqno read: 21860, pendingBackfill : False 2020-12-21T07:38:19.835680-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) ActiveStream::transitionState: Transitioning from backfilling to in-memory   2020-12-21T07:38:19.861513-08:00 ERROR 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":60,"opcode":"DCP_COMMIT","status":"Not found"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default, vb:825, state:in-memory 2020-12-21T07:38:19.861535-08:00 WARNING 3214: process_bin_dcp_response - response_handler returned ENGINE_DISCONNECT - closing connection [ {"ip":"172.23.107.78","port":59242} - {"ip":"172.23.107.95","port":11209} (<ud>@ns_server</ud>) ] 2020-12-21T07:38:19.861561-08:00 INFO 3214: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - Removing connection [ {"ip":"172.23.107.78","port":59242} - {"ip":"172.23.107.95","port":11209} (<ud>@ns_server</ud>) ] 2020-12-21T07:38:19.861613-08:00 WARNING 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Stream closing, sent until seqno 21861 remaining items 0, reason: The stream closed early because the conn was disconnected Based on the registerCursor logging, the failing commit at seqno 21861 is the first item sent from memory. The matching prepare at 21308 should have been received by the consumer before the reconnection from in-memory streaming, and was not rolled back. The backfill commences at 21314. Thanks to Ben Huddleston for a run down on MB-41942 regarding missing failover table entries. The active actually restarted more than once in quick succession: 172.23.107.95 2020-12-21T07:37:46.074671-08:00 INFO (default) VBucket: created vb:825 with state:active initialState:active lastSeqno:21308 persistedRange:{21308,21308} max_cas:1608565063637532672 uuid:95167882074391 topology:[["ns_1@172.23.107.95","ns_1@172.23.107.78"]] 2020-12-21T07:37:46.074699-08:00 INFO (default) Warmup::createVBuckets: vb:825 created new failover entry with uuid:34147705071032 and seqno:21308 due to unclean shutdown 2020-12-21T07:38:06.266106-08:00 INFO (default) VBucket: created vb:825 with state:active initialState:active lastSeqno:21857 persistedRange:{21857,21857} max_cas:1608565067267112961 uuid:95167882074391 topology:[["ns_1@172.23.107.95","ns_1@172.23.107.78"]] 2020-12-21T07:38:06.266134-08:00 INFO (default) Warmup::createVBuckets: vb:825 created new failover entry with uuid:202953901787434 and seqno:21857 due to unclean shutdown A failover table entry is created at uuid:34147705071032 , but by the time of the next restart, the uuid logged is once again as uuid:95167882074391 . and a stream to the replica was opened at a seqno between the two failover table entries: 2020-12-21T07:38:19.828043-08:00 INFO 3214: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.95->ns_1@172.23.107.78:default - (vb:825) Creating stream with start seqno 21313 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, filter:0 sid:none 21308 < 21313 < 21857 The replica may at that point have a diverging history but still be allowed to continue streaming. The items the replica has at seqnos 21308-21313 may be items the active lost during the first failover. Normally, the replica would be required to rollback because the failover table indicates the replica has seen items at a given seqno, but has not provided the uuid the active has in the failover table for that range of seqnos. However, because the failover table entry was lost due to MB-41942, the replica was instead allowed to continue streaming - the active incorrectly determined that the replica was only behind , not diverging. For the sake of a bit of ascii art, the "true" view of history is: (seqnos increasing to the right, forks indicating a failover losing items and creating a new uuid at a given seqno) +---------------+------> Timeline A | +------+----> B | +---------> C Instead, the active thinks it is as follows: +---------------+XXXXXX-> Timeline A | ?????? B | +---------> C Here, the active still has the items in the range of seqnos covered by ??????, whereas the replica has, at the same seqnos, the different items XXXXXX which the active actually lost in the first failover. Because of this inconsistency, the replica may have completed a prepare and removed it from the HT, while the active has since lost that completion to a failover event and has recommitted it after warming up. Upon streaming this new commit the replica fails as seen as the prepare is no longer in the HT. As such, this latest set of logs may be a dupe of MB-41942, and are likely unrelated to earlier logs in this ticket.

            Resolving as duplicate. Balakumaran Gopal If this issue is still present after MB-41942 is resolved, feel free to reopen.

            james.harrison James Harrison added a comment - Resolving as duplicate. Balakumaran Gopal If this issue is still present after MB-41942 is resolved, feel free to reopen.

            Mass closing all defects that has Resolution != Fixed

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Mass closing all defects that has Resolution != Fixed

            People

              james.harrison James Harrison
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty