Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37204

[Jepsen] Couchbase crash during hard-reboot scenario due to HCS > persistedCompletedSeqno

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 6.5.0
    • None
    • couchbase-bucket
    • None
    • Triaged
    • Yes
    • KV Sprint 2019-12

    Description

      During kv-engine-jepsen-nightly-352 we observed a and crash during the following test:
      lein trampoline run test --nodes-file ./nodes --username vagrant --ssh-private-key ./resources/vagrantkey --workload=kill --scenario=hard-reboot --replicas=2 --no-autofailover --durability=0:0:0:100 --eviction-policy=full --cas
      Crash Message:

      2019-12-08T01:36:51.777389+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620
      

      ep_bucket.cc:620

                      if (hcs) {
                          Expects(hcs > vbstate.persistedCompletedSeqno);
                          vbstate.persistedCompletedSeqno = *hcs;
                      }
      

      Couchbase version 6.5.1-6001

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            drigby Dave Rigby added a comment - - edited

            The earilist minidump in the logs is 29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp on node .174, from:

            2019-12-08T02:18:05.289332+00:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.1-6001). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp before terminating.
            

            Backtrace of crashing thread:

            (gdb) bt   
            #0  0x00007f2d6f4cf428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
            #1  0x00007f2d6f4d102a in __GI_abort () at abort.c:89
            #2  0x00007f2d70030ced in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:50
            #3  0x0000000000545f12 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86
            #4  0x00007f2d7002eb16 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
            #5  0x00007f2d7002eb61 in std::terminate () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
            #6  0x00007f2d7002eda3 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f2d1c000b40, tinfo=tinfo@entry=0x7f2d6af9d010 <typeinfo for gsl::fail_fast>, dest=dest@entry=0x7f2d6aadebf0 <gsl::fail_fast::~fail_fast()>)
                at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
            #7  0x00007f2d6ab57dd8 in gsl::fail_fast_assert (cond=<optimized out>, message=0x7f2d6acfa710 "GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620")
                at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite.h:473
            #8  0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620
            #9  0x00007f2d6abbc6e0 in Flusher::flushVB (this=this@entry=0x7f2d6bf11000) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:306
            #10 0x00007f2d6abbd5b1 in Flusher::step (this=0x7f2d6bf11000, task=0x7f2d6ef6d970) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:215
            #11 0x00007f2d6abb7154 in ExecutorThread::run (this=0x7f2d6be547a0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:190
            #12 0x00007f2d71e4ce27 in CouchbaseThread::run (this=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
            #13 platform_thread_wrap (arg=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
            #14 0x00007f2d6f86b6ba in start_thread (arg=0x7f2d3cdfa700) at pthread_create.c:333
            #15 0x00007f2d6f5a141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
            

            Frame #8 has most of the interesting state - we crashed when attempting to flush vb:134 (replica). hcs is not available (optimized out), but vbstate.persistedCompletedSeqno is available and is equal to 9; so hcs must have been set and had value 9 or lower:

            #8  0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620
                    rlh = {lock_ = 0x7f2d3fea62a0, token_ = {type_ = folly::SharedMutexToken::Type::INLINE_SHARED, slot_ = 0}}
                    options = VBStatePersist::VBSTATE_CACHE_UPDATE_ONLY
                    prev = <optimized out>
                    maxSeqno = 36
                    maxVbStateOpCas = <optimized out>
                    minSeqno = 35
                    mustCheckpointVBState = <optimized out>
                    collectionFlush = ...
                    flush_end = <optimized out>
                    persistedVbState = <optimized out>
                    hcs = <optimized out>
                    hps = <optimized out>
                    maxVisibleSeqno = <optimized out>
                    trans_time = <optimized out>
                    toFlush = {items = ...,
                               ranges = ...,
                               moreAvailable = false,
                               maxDeletedRevSeqno = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, 
                          m_initialized = false, m_storage = 0}, <No data fields>},
                               checkpointType = CheckpointType::Disk}
                    range = {<boost::optional_detail::optional_base<snapshot_range_t>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = true, m_storage = {dummy_ = {data = "$\000\000\000\000\000\000\000$\000\000\000\000\000\000", 
                            aligner_ = {<No data fields>}}}}, <No data fields>}
                    rwUnderlying = 0x7f2d6ef46800
                    vbstate = {static CurrentVersion = 3,
                               state = vbucket_state_dead,
                               maxDeletedSeqno = {counter = {_M_elems = "\000\000\000\000\000"}},
                               highSeqno = 10,
                               purgeSeqno = 0,
                               lastSnapStart = 36,
                               lastSnapEnd = 36,
                               maxCas = 1575771481264816128,
                               hlcCasEpochSeqno = 1,
                               mightContainXattrs = false,
                               supportsNamespaces = true,
                               version = 3,
                               persistedCompletedSeqno = 9,
                               persistedPreparedSeqno = 9,
                               highPreparedSeqno = 36,
                               maxVisibleSeqno = 10,
                               onDiskPrepares = 1,
                               checkpointType = CheckpointType::Disk,
                               transition = {failovers = {...},
                                             replicationTopology = {...},
                                             state = vbucket_state_replica}
                               }
                    vbstateRollback = {static CurrentVersion = 3,
                                       state = vbucket_state_dead,
                                       maxDeletedSeqno = {counter = {_M_elems = "\000\000\000\000\000"}},
                                       highSeqno = 10,
                                       lastSnapStart = 10,
                                       lastSnapEnd = 10,
                                       maxCas = 1575771443439534080, 
                                       hlcCasEpochSeqno = 1,
                                       mightContainXattrs = false,
                                       supportsNamespaces = true,
                                       version = 3,
                                       persistedCompletedSeqno = 9,
                                       persistedPreparedSeqno = 9,
                                       highPreparedSeqno = 9,
                                       maxVisibleSeqno = 10,
                                       onDiskPrepares = 1, 
                                       checkpointType = CheckpointType::Memory,
                                       transition = {failovers = ...,
                                                     replicationTopology = {...},
                                                     state = vbucket_state_replica}}
                    items = ...
                    items_flushed = <optimized out>
                    moreAvailable = false
                    flush_start = {__d = {__r = 28865831823}}
                    vb = ...
            

            From the logs, the history of vb:134 is as follows:

            1. Clean shutdown of bucket (from ns_server).
            2. On restart, created as a replica during warmup, no prepares loaded from disk; then it is populated from node .15 then throws an exception during flushing:

            2019-12-08T02:18:03.505480+00:00 INFO (default) VBucket: created vb:134 with state:replica initialState:replica lastSeqno:10 persistedRange:{10,10} max_cas:1575771443439534080 uuid:27948380356516 topology:null
            2019-12-08T02:18:03.790250+00:00 INFO (default) EPVBucket::loadOutstandingPrepares: (vb:134) created DM with PCS:9, PPS:9, HPS:9, number of prepares loaded:0
            2019-12-08T02:18:03.810857+00:00 DEBUG (default) Mapped new vb:134 in state replica
            2019-12-08T02:18:03.899025+00:00 DEBUG (default) CouchKVStore::snapshotVBucket: Snapshotted vb:134 state:{"checkpoint_type":"Memory","completed_seqno":"9","failover_table":[{"id":27948380356516,"seq":0}],"high_prepared_seqno":"9","high_seqno":"10","hlc_epoch":"1","max_cas":"1575771443439534080","max_deleted_seqno":"0","max_visible_seqno":"10","might_contain_xattrs":false,"namespaces_supported":true,"on_disk_prepares":"1","prepared_seqno":"9","purge_seqno":"0","snap_end":"10","snap_start":"10","state":"replica","version":3}
            2019-12-08T02:18:04.775912+00:00 INFO 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - (vb:134) Attempting to add stream: opaque_:52, start_seqno_:10, end_seqno_:18446744073709551615, vb_uuid:27948380356516, snap_start_seqno_:10, snap_end_seqno_:10, last_seqno:10, stream_req_value:{"uid":"0"}
            2019-12-08T02:18:04.963678+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - (vb:134) Add stream for opaque 52 with error code succeeded
            2019-12-08T02:18:04.963706+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - PassiveStream::transitionState: (vb:134) Transitioning from pending to awaiting-first-snapshot-marker
            2019-12-08T02:18:04.963944+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - PassiveStream::transitionState: (vb:134) Transitioning from awaiting-first-snapshot-marker to reading
             
            # Expects() in flushVBucket() fails 2s later:
            2019-12-08T02:18:05.071273+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620
            

            drigby Dave Rigby added a comment - - edited The earilist minidump in the logs is 29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp on node .174, from: 2019-12-08T02:18:05.289332+00:00 CRITICAL Breakpad caught a crash (Couchbase version 6.5.1-6001). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/29e7f66c-7f2c-8814-34d3a1da-481d982d.dmp before terminating. Backtrace of crashing thread: (gdb) bt #0 0x00007f2d6f4cf428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007f2d6f4d102a in __GI_abort () at abort.c:89 #2 0x00007f2d70030ced in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:50 #3 0x0000000000545f12 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86 #4 0x00007f2d7002eb16 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47 #5 0x00007f2d7002eb61 in std::terminate () at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57 #6 0x00007f2d7002eda3 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f2d1c000b40, tinfo=tinfo@entry=0x7f2d6af9d010 <typeinfo for gsl::fail_fast>, dest=dest@entry=0x7f2d6aadebf0 <gsl::fail_fast::~fail_fast()>) at /tmp/deploy/objdir/../gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93 #7 0x00007f2d6ab57dd8 in gsl::fail_fast_assert (cond=<optimized out>, message=0x7f2d6acfa710 "GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620") at /home/couchbase/jenkins/workspace/couchbase-server-unix/third_party/gsl-lite/include/gsl/gsl-lite.h:473 #8 0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620 #9 0x00007f2d6abbc6e0 in Flusher::flushVB (this=this@entry=0x7f2d6bf11000) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:306 #10 0x00007f2d6abbd5b1 in Flusher::step (this=0x7f2d6bf11000, task=0x7f2d6ef6d970) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:215 #11 0x00007f2d6abb7154 in ExecutorThread::run (this=0x7f2d6be547a0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:190 #12 0x00007f2d71e4ce27 in CouchbaseThread::run (this=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58 #13 platform_thread_wrap (arg=0x7f2d3fd123f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71 #14 0x00007f2d6f86b6ba in start_thread (arg=0x7f2d3cdfa700) at pthread_create.c:333 #15 0x00007f2d6f5a141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 Frame #8 has most of the interesting state - we crashed when attempting to flush vb:134 (replica). hcs is not available (optimized out), but vbstate.persistedCompletedSeqno is available and is equal to 9; so hcs must have been set and had value 9 or lower: #8 0x00007f2d6ab6b39f in EPBucket::flushVBucket (this=0x7f2d6bdef300, vbid=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:620 rlh = {lock_ = 0x7f2d3fea62a0, token_ = {type_ = folly::SharedMutexToken::Type::INLINE_SHARED, slot_ = 0}} options = VBStatePersist::VBSTATE_CACHE_UPDATE_ONLY prev = <optimized out> maxSeqno = 36 maxVbStateOpCas = <optimized out> minSeqno = 35 mustCheckpointVBState = <optimized out> collectionFlush = ... flush_end = <optimized out> persistedVbState = <optimized out> hcs = <optimized out> hps = <optimized out> maxVisibleSeqno = <optimized out> trans_time = <optimized out> toFlush = {items = ..., ranges = ..., moreAvailable = false, maxDeletedRevSeqno = {<boost::optional_detail::tc_optional_base<unsigned long>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = false, m_storage = 0}, <No data fields>}, checkpointType = CheckpointType::Disk} range = {<boost::optional_detail::optional_base<snapshot_range_t>> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = true, m_storage = {dummy_ = {data = "$\000\000\000\000\000\000\000$\000\000\000\000\000\000", aligner_ = {<No data fields>}}}}, <No data fields>} rwUnderlying = 0x7f2d6ef46800 vbstate = {static CurrentVersion = 3, state = vbucket_state_dead, maxDeletedSeqno = {counter = {_M_elems = "\000\000\000\000\000"}}, highSeqno = 10, purgeSeqno = 0, lastSnapStart = 36, lastSnapEnd = 36, maxCas = 1575771481264816128, hlcCasEpochSeqno = 1, mightContainXattrs = false, supportsNamespaces = true, version = 3, persistedCompletedSeqno = 9, persistedPreparedSeqno = 9, highPreparedSeqno = 36, maxVisibleSeqno = 10, onDiskPrepares = 1, checkpointType = CheckpointType::Disk, transition = {failovers = {...}, replicationTopology = {...}, state = vbucket_state_replica} } vbstateRollback = {static CurrentVersion = 3, state = vbucket_state_dead, maxDeletedSeqno = {counter = {_M_elems = "\000\000\000\000\000"}}, highSeqno = 10, lastSnapStart = 10, lastSnapEnd = 10, maxCas = 1575771443439534080, hlcCasEpochSeqno = 1, mightContainXattrs = false, supportsNamespaces = true, version = 3, persistedCompletedSeqno = 9, persistedPreparedSeqno = 9, highPreparedSeqno = 9, maxVisibleSeqno = 10, onDiskPrepares = 1, checkpointType = CheckpointType::Memory, transition = {failovers = ..., replicationTopology = {...}, state = vbucket_state_replica}} items = ... items_flushed = <optimized out> moreAvailable = false flush_start = {__d = {__r = 28865831823}} vb = ... From the logs, the history of vb:134 is as follows: Clean shutdown of bucket (from ns_server). On restart, created as a replica during warmup, no prepares loaded from disk; then it is populated from node .15 then throws an exception during flushing: 2019-12-08T02:18:03.505480+00:00 INFO (default) VBucket: created vb:134 with state:replica initialState:replica lastSeqno:10 persistedRange:{10,10} max_cas:1575771443439534080 uuid:27948380356516 topology:null 2019-12-08T02:18:03.790250+00:00 INFO (default) EPVBucket::loadOutstandingPrepares: (vb:134) created DM with PCS:9, PPS:9, HPS:9, number of prepares loaded:0 2019-12-08T02:18:03.810857+00:00 DEBUG (default) Mapped new vb:134 in state replica 2019-12-08T02:18:03.899025+00:00 DEBUG (default) CouchKVStore::snapshotVBucket: Snapshotted vb:134 state:{"checkpoint_type":"Memory","completed_seqno":"9","failover_table":[{"id":27948380356516,"seq":0}],"high_prepared_seqno":"9","high_seqno":"10","hlc_epoch":"1","max_cas":"1575771443439534080","max_deleted_seqno":"0","max_visible_seqno":"10","might_contain_xattrs":false,"namespaces_supported":true,"on_disk_prepares":"1","prepared_seqno":"9","purge_seqno":"0","snap_end":"10","snap_start":"10","state":"replica","version":3} 2019-12-08T02:18:04.775912+00:00 INFO 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - (vb:134) Attempting to add stream: opaque_:52, start_seqno_:10, end_seqno_:18446744073709551615, vb_uuid:27948380356516, snap_start_seqno_:10, snap_end_seqno_:10, last_seqno:10, stream_req_value:{"uid":"0"} 2019-12-08T02:18:04.963678+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - (vb:134) Add stream for opaque 52 with error code succeeded 2019-12-08T02:18:04.963706+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - PassiveStream::transitionState: (vb:134) Transitioning from pending to awaiting-first-snapshot-marker 2019-12-08T02:18:04.963944+00:00 DEBUG 64: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.15->ns_1@172.28.128.174:default - PassiveStream::transitionState: (vb:134) Transitioning from awaiting-first-snapshot-marker to reading   # Expects() in flushVBucket() fails 2s later: 2019-12-08T02:18:05.071273+00:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc: 620
            drigby Dave Rigby added a comment -

            Problem identified - issue is with fix for MB-37161 (http://review.couchbase.org/#/c/119024/). The assumption that the DCP Consumer would have fully negotiated SyncReplication when the PassiveStream object is created is invalid. We must instead wait for DCP Control negotiation to complete.

            Fix should be relatively straightforward- should have fix ready by end of tomorrow (Tuesday 10th).

            drigby Dave Rigby added a comment - Problem identified - issue is with fix for MB-37161 ( http://review.couchbase.org/#/c/119024/ ). The assumption that the DCP Consumer would have fully negotiated SyncReplication when the PassiveStream object is created is invalid. We must instead wait for DCP Control negotiation to complete. Fix should be relatively straightforward- should have fix ready by end of tomorrow (Tuesday 10th).
            drigby Dave Rigby added a comment -

            This issue is caused by an incomplete fix for MB-37161 - the initial fix for MB-37161 incorrectly assumed DCP features were negotiated before PassiveStream objects would be created.

            Re-opening MB-37161 and amending the fix there.

            drigby Dave Rigby added a comment - This issue is caused by an incomplete fix for MB-37161 - the initial fix for MB-37161 incorrectly assumed DCP features were negotiated before PassiveStream objects would be created. Re-opening MB-37161 and amending the fix there.
            drigby Dave Rigby added a comment -

            Duplicate of MB-37161.

            drigby Dave Rigby added a comment - Duplicate of MB-37161 .

            Bulk closing invalid, duplicate and won't fix bugs

            raju Raju Suravarjjala added a comment - Bulk closing invalid, duplicate and won't fix bugs

            People

              drigby Dave Rigby
              richard.demellow Richard deMellow
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty