Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39668

Memcached coredumps found data load+ collection maxttl + durability + graceful failover + full recovery

    XMLWordPrintable

Details

    Description

      Script to repro

      ./testrunner -i /tmp/testexec.20770.ini GROUP=collections_maxt_ttl_durability_majority_and_persist_active,rerun=False -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_recovery,data_load_stage=during,nodes_failover=1,replicas=Bucket.ReplicaNum.TWO,durability=MAJORITY_AND_PERSIST_TO_ACTIVE,recovery_type=full,rerun=False,nodes_init=5,GROUP=collections_maxt_ttl_durability_majority_and_persist_active,bucket_spec=multi_bucket.buckets_all_membase_for_rebalance_tests_with_ttl,data_load_spec=ttl_load,override_spec_params=durability;replicas
      

      Steps to Repro
      1) Create 5 node cluster
      2020-05-31 22:17:01,617 | test | INFO | pool-8-thread-6 | [table_view:display:72] Rebalance Overview
      ------------------------------------

      Nodes Services Status

      ------------------------------------

      172.23.123.99 kv Cluster node
      172.23.123.97 None <--- IN —
      172.23.123.98 None <--- IN —
      172.23.123.96 None <--- IN —
      172.23.121.215 None <--- IN —

      ------------------------------------

      2) Create scopes + collections with maxttl enabled + data load
      2020-05-31 22:21:35,349 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
      -------------------------------------------------------------+

      Bucket Type Replicas TTL Items RAM Quota RAM Used Disk Used

      -------------------------------------------------------------+

      bucket1 membase 2 0 20000 524288000 115836928 227019523
      bucket2 membase 2 0 30000 524288000 111503680 344862284
      default membase 2 0 110000 7864320000 164804832 197975843

      -------------------------------------------------------------+

      3) Start data load again
      2020-05-31 22:21:38,750 | test | INFO | MainThread | [collections_rebalance:load_collections_with_rebalance:528] Doing collection data load during graceful_failover_recovery

      4) Start graceful failover
      2020-05-31 22:21:38,750 | test | INFO | MainThread | [collections_rebalance:rebalance_operation:110] Starting rebalance operation of type : graceful_failover_recovery
      2020-05-31 22:22:18,857 | test | INFO | MainThread | [collections_rebalance:wait_for_failover_or_assert:82] 1 nodes failed over as expected in 40.1039998531 seconds

      5) Start data load again
      2020-05-31 22:22:18,857 | test | INFO | MainThread | [collections_rebalance:data_load_after_failover:63] Starting a sync data load after failover

      6) Start full recover + rebalance
      2020-05-31 22:23:24,517 | test | WARNING | MainThread | [rest_client:get_nodes:1671] 172.23.121.215 - Node not part of cluster inactiveFailed
      2020-05-31 22:23:31,980 | test | INFO | MainThread | [common_lib:sleep:22] Sleep 60 seconds. Reason: MB-38497
      2020-05-31 22:23:31,994 | test | INFO | pool-8-thread-6 | [table_view:display:72] Rebalance Overview
      ------------------------------------

      Nodes Services Status

      ------------------------------------

      172.23.123.98 kv Cluster node
      172.23.121.215 kv Cluster node
      172.23.123.97 kv Cluster node
      172.23.123.96 kv Cluster node
      172.23.123.99 kv Cluster node

      ------------------------------------

      2020-05-31 22:23:32,006 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: running, progress: 0
      2020-05-31 22:23:37,051 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: running, progress: 3.57119544469
      2020-05-31 22:23:42,117 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: running, progress: 17.8320907427
      2020-05-31 22:23:47,155 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: running, progress: 33.3333333333
      2020-05-31 22:23:52,193 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: running, progress: 41.7232494062
      2020-05-31 22:23:57,246 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: running, progress: 58.2498936183
      2020-05-31 22:24:02,305 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: running, progress: 68.3694998964
      2020-05-31 22:24:07,380 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: running, progress: 80.6589720231
      2020-05-31 22:24:12,444 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: running, progress: 97.7709339668
      2020-05-31 22:24:22,519 | test | INFO | pool-8-thread-6 | [task:check:318] Rebalance - status: none, progress: 100

      Rebalance completes successfully. Wait for data load started on step 5 to complete.
      After data load completes we see coredump on 172.23.123.97.

      From memcached log

      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.373958-07:00 CRITICAL *** Fatal error encountered during exception handling ***
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.374059-07:00 CRITICAL Caught unhandled std::exception-derived exception. what(): cb::mcbp::is_request(): Invalid magic: 0x00
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.399510-07:00 CRITICAL *** Fatal error encountered during exception handling ***
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.572997-07:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-2208). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/fb2663cc-b00b-47c6-7ff701a2-69d4adf6.dmp before terminating.
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573035-07:00 CRITICAL Stack backtrace of crashed thread:
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573351-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x1397ad]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573368-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x14f4fa]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573384-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x14f838]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573400-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7fe099b0f000+0xf5f0]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573440-07:00 CRITICAL     /lib64/libc.so.6(gsignal+0x37) [0x7fe099741000+0x36337]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573472-07:00 CRITICAL     /lib64/libc.so.6(abort+0x148) [0x7fe099741000+0x37a28]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573509-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7fe09a244000+0x91195]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573525-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x14a9f2]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573553-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fe09a244000+0x8ef86]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573580-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fe09a244000+0x8efd1]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573606-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7fe09a244000+0x8f213]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573616-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x10575a]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573629-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x10cbe4]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573644-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x9583f]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573658-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x84869]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573671-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x84be5]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573683-07:00 CRITICAL     /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fe09b592000+0xed8e]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573689-07:00 CRITICAL     /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fe09b592000+0x17d01]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573700-07:00 CRITICAL     /opt/couchbase/bin/../lib/libevent_core-2.1.so.7(event_base_loop+0x3bf) [0x7fe09b592000+0x1854f]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573717-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x692a9]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573729-07:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7fe09c4c9000+0x10777]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573735-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7fe099b0f000+0x7e65]
      cbcollect_info_ns_1@172.23.123.97_20200601-053045/memcached.log:2020-05-31T22:15:35.573782-07:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7fe099741000+0xfe88d]
      

      gdb stacktrace

      (gdb) bt full
      #0  0x00007fe099777337 in ssignal () from /lib64/libc.so.6
      No symbol table info available.
      #1  0x00007fe099778a28 in abort () from /lib64/libc.so.6
      No symbol table info available.
      #2  0x00007fe09a2d5195 in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
              terminating = false
              t = <optimized out>
      #3  0x000000000054a9f2 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86
      No locals.
      #4  0x00007fe09a2d2f86 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
      No locals.
      #5  0x00007fe09a2d2fd1 in std::terminate () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
      No locals.
      #6  0x00007fe09a2d3213 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7fe06c05d8f0, tinfo=0x7fb928 <typeinfo for std::invalid_argument>, dest=0x418ec0 <_ZNSt16invalid_argumentD1Ev@plt>)
          at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
              globals = <optimized out>
              header = 0x7fe06c05d870
      #7  0x000000000050575a in cb::mcbp::is_request (magic=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/protocol/mcbp/magic.cc:72
      No locals.
      #8  0x000000000050cbe4 in isRequest (this=0x7fe042c4b620) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/include/mcbp/protocol/header.h:107
      No locals.
      #9  cb::mcbp::Header::toJSON[abi:cxx11](bool) const (this=0x7fe042c4b620, validated=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/protocol/mcbp/header.cc:37
      No locals.
      #10 0x000000000049583f in Cookie::toJSON[abi:cxx11]() const (this=0x7fe0434efc00) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/cookie.cc:53
              header = <optimized out>
              ret = {m_type = nlohmann::detail::null, m_value = {object = 0x0, array = 0x0, string = 0x0, boolean = false, number_integer = 0, number_unsigned = 0, number_float = 0}}
      #11 0x0000000000484869 in Connection::executeCommandsCallback (this=this@entry=0x7fe0434ba500) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/connection.cc:646
              start = <optimized out>
              stop = <optimized out>
              ns = <optimized out>
      #12 0x0000000000484be5 in Connection::rw_callback (ctx=0x7fe0434ba500) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/connection.cc:762
              instance = @0x7fe0434ba500: <error reading variable>
              phosphor_internal_category_enabled_741 = {_M_b = {_M_p = 0x0}, static is_always_lock_free = <error reading variable: No global symbol "std::atomic<std::atomic<phosphor::CategoryStatus> const*>::is_always_lock_free".>}
              phosphor_internal_category_enabled_temp_741 = <optimized out>
              phosphor_internal_tpi_wait_741 = {category = 0x587cf0 "mutex", name = 0x58b620 "Connection::rw_callback::threadLock.wait", type = phosphor::Complete, argument_names = {_M_elems = {0x587c77 "this", 0x59da13 ""}}, 
                argument_types = {_M_elems = {phosphor::is_pointer, phosphor::is_none}}}
              phosphor_internal_tpi_held_741 = {category = 0x587cf0 "mutex", name = 0x58b5f0 "Connection::rw_callback::threadLock.held", type = phosphor::Complete, argument_names = {_M_elems = {0x59da13 "", 0x59da13 ""}}, argument_types = {
                  _M_elems = {phosphor::is_pointer, phosphor::is_none}}}
              phosphor_internal_guard_741 = {tpiWait = 0x58b860 <Connection::rw_callback(bufferevent*, void*)::phosphor_internal_tpi_wait_741>, tpiHeld = 0x58b820 <Connection::rw_callback(bufferevent*, void*)::phosphor_internal_tpi_held_741>, 
                enabled = true, mutex = @0x7fe0982570d0, threshold = {__r = 10000000}, start = {__d = {__r = 279975507770418}}, lockedAt = {__d = {__r = 279975507771346}}, releasedAt = {__d = {__r = 0}}}
      #13 0x00007fe09b5a0d8e in bufferevent_run_deferred_callbacks_unlocked (cb=<optimized out>, arg=0x7fe06c0587e0)
          at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/libevent/libevent-prefix/src/libevent/bufferevent.c:208
              readcb = 0x484a20 <Connection::rw_callback(bufferevent*, void*)>
              cbarg = 0x7fe0434ba500
              bufev_private = 0x7fe06c0587e0
              bufev = 0x7fe06c0587e0
      #14 0x00007fe09b5a9d01 in event_process_active_single_queue (base=base@entry=0xdd9250, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0, activeq=<optimized out>)
          at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/libevent/libevent-prefix/src/libevent/event.c:1726
              evcb_cbfinalize = <optimized out>
              ev = 0x0
              evcb = 0x7fe06c058980
              count = 4
      #15 0x00007fe09b5aa54f in event_process_active (base=0xdd9250) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/libevent/libevent-prefix/src/libevent/event.c:1789
              activeq = <optimized out>
              i = 0
              c = 0
              tv = {tv_sec = 0, tv_usec = 140602627979135}
              maxcb = 2147483647
              endtime = 0x0
      ---Type <return> to continue, or q <return> to quit---
              limit_after_prio = 2147483647
      #16 event_base_loop (base=0xdd9250, flags=flags@entry=0) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/libevent/libevent-prefix/src/libevent/event.c:2012
              evsel = 0x7fe09b7bdc20 <epollops>
              tv = {tv_sec = 140602581140192, tv_usec = 140602669517482}
              tv_p = <optimized out>
              res = <optimized out>
              done = 0
              retval = 0
              __func__ = "event_base_loop"
      #17 0x00000000004692a9 in worker_libevent (arg=0x7fe098256ff8) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/daemon/thread.cc:210
              me = @0x7fe098256ff8: <error reading variable>
      #18 0x00007fe09c4d9777 in run (this=0x7fe098286790) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
      No locals.
      #19 platform_thread_wrap (arg=0x7fe098286790) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
              context = {_M_t = {
                  _M_t = {<std::_Tuple_impl<0, CouchbaseThread*, std::default_delete<CouchbaseThread> >> = {<std::_Tuple_impl<1, std::default_delete<CouchbaseThread> >> = {<std::_Head_base<1, std::default_delete<CouchbaseThread>, true>> = {<std::default_delete<CouchbaseThread>> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, CouchbaseThread*, false>> = {_M_head_impl = 0x7fe098286790}, <No data fields>}, <No data fields>}}}
      #20 0x00007fe099b16e65 in start_thread () from /lib64/libpthread.so.0
      No symbol table info available.
      #21 0x00007fe09983f88d in clone () from /lib64/libc.so.6
      No symbol table info available.
      #22 0x0000000000000000 in ?? ()
      No symbol table info available.
      (gdb) 
      

      cbcollect_info attached.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty