Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50747

[Magma] :- EPVBucket::setNumTotalItems vb:169 caught exception itemCount:18446744073709551569 - onDiskPrepares:0 - systemEvents:0 = 18446744073709551569 e.what:ThrowExceptionUnderflowPolicy current:0 arg:-47

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • No
    • KV 2022-Feb

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,get-cbcollect-info=True,GROUP=hard_failover_and_delta_recovery_P0_set0,nodes_init=4,doc_size=250,bucket_spec=magma_dgm.10_percent_dgm.5_node_1_replica_magma_512_single_bucket,nodes_failover=1 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,nodes_init=5,nodes_failover=2,recovery_type=delta,step_count=1,bucket_spec=magma_dgm.5_percent_dgm.5_node_2_replica_magma_512,data_load_stage=during,doc_size=512,randomize_value=True,skip_validations=False,GROUP=hard_failover_and_delta_recovery_P0_set0'
      

      Steps to Repro
      1. Create a 4 node cluster.

      2022-02-02 02:29:41,744 | test  | INFO    | MainThread | [table_view:display:72] Cluster statistics
      +----------------+----------+-----------------+-----------+----------+---------------------+-------------------+-----------------------+
      | Node           | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used       | Active / Replica  | Version               |
      +----------------+----------+-----------------+-----------+----------+---------------------+-------------------+-----------------------+
      | 172.23.104.186 | kv       | 1.07904642409   | 5.66 GiB  | 4.97 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2211-enterprise |
      | 172.23.120.201 | kv       | 1.85370741483   | 3.67 GiB  | 3.09 GiB | 2.50 MiB / 3.50 GiB | 0 / 0             | 7.1.0-2211-enterprise |
      | 172.23.120.206 | kv       | 1.63029847003   | 3.67 GiB  | 3.08 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.0-2211-enterprise |
      | 172.23.121.10  | kv       | 2.30807827396   | 3.91 GiB  | 3.31 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2211-enterprise |
      +----------------+----------+-----------------+-----------+----------+---------------------+-------------------+-----------------------+
      

      2. Create bucket/scopes/collections/data

      2022-02-02 03:10:36,709 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics
      +---------+-----------+-----------------+----------+------------+-----+----------+-----------+----------+-----------+---------------+
      | Bucket  | Type      | Storage Backend | Replicas | Durability | TTL | Items    | RAM Quota | RAM Used | Disk Used | ARR           |
      +---------+-----------+-----------------+----------+------------+-----+----------+-----------+----------+-----------+---------------+
      | default | couchbase | magma           | 1        | none       | 0   | 24500000 | 2.00 GiB  | 1.26 GiB | 9.05 GiB  | 16.5019428571 |
      +---------+-----------+-----------------+----------+------------+-----+----------+-----------+----------+-----------+---------------+
      

      3. Hard failover a node 172.23.104.186.

      2022-02-02 03:10:41,178 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:755] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root]
      

      4. Delta recover the node and do a rebalance.

      2022-02-02 06:47:58,918 | test  | WARNING | MainThread | [rest_client:get_nodes:1885] 172.23.104.186 - Node not part of cluster inactiveFailed
      2022-02-02 06:49:10,990 | test  | INFO    | MainThread | [bucket_ready_functions:perform_tasks_from_spec:4953] Performing scope/collection specific operations
      2022-02-02 06:49:10,997 | test  | INFO    | MainThread | [bucket_ready_functions:perform_tasks_from_spec:5043] Done Performing scope/collection specific operations
      2022-02-02 06:49:11,190 | test  | INFO    | pool-5-thread-24 | [table_view:display:72] Rebalance Overview
      +----------------+----------+-----------------------+---------------+--------------+
      | Nodes          | Services | Version               | CPU           | Status       |
      +----------------+----------+-----------------------+---------------+--------------+
      | 172.23.104.186 | kv       | 7.1.0-2211-enterprise | 1.25407574618 | Cluster node |
      | 172.23.120.201 | kv       | 7.1.0-2211-enterprise | 5.23940837303 | Cluster node |
      | 172.23.120.206 | kv       | 7.1.0-2211-enterprise | 15.7775541017 | Cluster node |
      | 172.23.121.10  | kv       | 7.1.0-2211-enterprise | 5.55695882799 | Cluster node |
      +----------------+----------+-----------------------+---------------+--------------+
      

      5. This rebalance hangs for good 35+ mins
      5. So we do a cbcollect and kill memcached on one nodes and delete the buckets.

      At this point we saw this minidump. I was running these scripts in a loop to verify MB-50519 and this has almost exact same set of steps and the minidump is seen at the almost exact time during the test. So, very well could be related to that.

      grep CRITICAL on 172.23.104.186

      memcached.log.000001.txt:2022-02-02T07:46:49.616458-08:00 CRITICAL (default) EPVBucket::setNumTotalItems vb:166 caught exception itemCount:18446744073709551576 - onDiskPrepares:0  - systemEvents:0 = 18446744073709551576 e.what:ThrowExceptionUnderflowPolicy current:0 arg:-40
      memcached.log.000001.txt:2022-02-02T07:46:49.616560-08:00 CRITICAL *** Fatal error encountered during exception handling ***
      memcached.log.000001.txt:2022-02-02T07:46:49.616642-08:00 CRITICAL (default) EPVBucket::setNumTotalItems vb:168 caught exception itemCount:18446744073709551569 - onDiskPrepares:0  - systemEvents:0 = 18446744073709551569 e.what:ThrowExceptionUnderflowPolicy current:0 arg:-47
      memcached.log.000001.txt:2022-02-02T07:46:49.616647-08:00 CRITICAL Caught unhandled std::exception-derived exception. what(): std::exception
      memcached.log.000001.txt:2022-02-02T07:46:49.616682-08:00 CRITICAL *** Fatal error encountered during exception handling ***
      memcached.log.000001.txt:2022-02-02T07:46:49.617390-08:00 CRITICAL (default) EPVBucket::setNumTotalItems vb:169 caught exception itemCount:18446744073709551569 - onDiskPrepares:0  - systemEvents:0 = 18446744073709551569 e.what:ThrowExceptionUnderflowPolicy current:0 arg:-47
      memcached.log.000001.txt:2022-02-02T07:46:49.617432-08:00 CRITICAL *** Fatal error encountered during exception handling ***
      memcached.log.000001.txt:2022-02-02T07:46:50.085731-08:00 CRITICAL Detected previous crash
      memcached.log.000001.txt:2022-02-02T07:46:50.085819-08:00 CRITICAL Breakpad caught a crash (Couchbase version 7.1.0-2211). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/138fc47e-079c-45c8-16e52ca1-0089835e.dmp before terminating.
      memcached.log.000001.txt:2022-02-02T07:46:50.085837-08:00 CRITICAL Stack backtrace of crashed thread:
      memcached.log.000001.txt:2022-02-02T07:46:50.085839-08:00 CRITICAL    #0  /opt/couchbase/bin/memcached() [0x400000+0x739fb8]
      memcached.log.000001.txt:2022-02-02T07:46:50.085841-08:00 CRITICAL    #1  /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x78b78a]
      memcached.log.000001.txt:2022-02-02T07:46:50.085844-08:00 CRITICAL    #2  /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x78bac8]
      memcached.log.000001.txt:2022-02-02T07:46:50.085846-08:00 CRITICAL    #3  /lib64/libpthread.so.0() [0x7f70c345f000+0xf630]
      memcached.log.000001.txt:2022-02-02T07:46:50.085848-08:00 CRITICAL    #4  /lib64/libc.so.6(gsignal+0x37) [0x7f70c0ca9000+0x363d7]
      memcached.log.000001.txt:2022-02-02T07:46:50.085849-08:00 CRITICAL    #5  /lib64/libc.so.6(abort+0x148) [0x7f70c0ca9000+0x37ac8]
      memcached.log.000001.txt:2022-02-02T07:46:50.085851-08:00 CRITICAL    #6  /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f70c1591000+0xa67f2]
      memcached.log.000001.txt:2022-02-02T07:46:50.085853-08:00 CRITICAL    #7  /opt/couchbase/bin/memcached() [0x400000+0x74448b]
      memcached.log.000001.txt:2022-02-02T07:46:50.085855-08:00 CRITICAL    #8  /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f70c1591000+0xa48f6]
      memcached.log.000001.txt:2022-02-02T07:46:50.085856-08:00 CRITICAL    #9  /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f70c1591000+0xa4961]
      memcached.log.000001.txt:2022-02-02T07:46:50.085857-08:00 CRITICAL    #10 /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f70c1591000+0xa4bf4]
      memcached.log.000001.txt:2022-02-02T07:46:50.085859-08:00 CRITICAL    #11 /opt/couchbase/bin/memcached() [0x400000+0xcf077]
      memcached.log.000001.txt:2022-02-02T07:46:50.085861-08:00 CRITICAL    #12 /opt/couchbase/bin/memcached() [0x400000+0x349401]
      memcached.log.000001.txt:2022-02-02T07:46:50.085863-08:00 CRITICAL    #13 /opt/couchbase/bin/memcached() [0x400000+0x353421]
      memcached.log.000001.txt:2022-02-02T07:46:50.085864-08:00 CRITICAL    #14 /opt/couchbase/bin/memcached() [0x400000+0x6b7502]
      memcached.log.000001.txt:2022-02-02T07:46:50.085865-08:00 CRITICAL    #15 /opt/couchbase/bin/memcached() [0x400000+0x6b4605]
      memcached.log.000001.txt:2022-02-02T07:46:50.085867-08:00 CRITICAL    #16 /opt/couchbase/bin/memcached() [0x400000+0x808ec0]
      memcached.log.000001.txt:2022-02-02T07:46:50.085868-08:00 CRITICAL    #17 /opt/couchbase/bin/memcached() [0x400000+0x7f37fa]
      memcached.log.000001.txt:2022-02-02T07:46:50.085869-08:00 CRITICAL    #18 /opt/couchbase/bin/memcached() [0x400000+0x80be79]
      memcached.log.000001.txt:2022-02-02T07:46:50.085871-08:00 CRITICAL    #19 /opt/couchbase/bin/memcached() [0x400000+0x6b4294]
      memcached.log.000001.txt:2022-02-02T07:46:50.085872-08:00 CRITICAL    #20 /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f70c1591000+0xcdd40]
      memcached.log.000001.txt:2022-02-02T07:46:50.085874-08:00 CRITICAL    #21 /lib64/libpthread.so.0() [0x7f70c345f000+0x7ea5]
      memcached.log.000001.txt:2022-02-02T07:46:50.085875-08:00 CRITICAL    #22 /lib64/libc.so.6(clone+0x6d) [0x7f70c0ca9000+0xfe9fd]
      [root@s71920 logs]# 
      

      bt of 138fc47e-079c-45c8-16e52ca1-0089835e.dmp from 172.23.104.186

      (gdb) bt
      #0  0x00007f70c0cdf3d7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
      #1  0x00007f70c0ce0ac8 in __GI_abort () at abort.c:90
      #2  0x00007f70c16377f2 in __gnu_cxx::__verbose_terminate_handler() () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/vterminate.cc:50
      #3  0x0000000000b4448b in backtrace_terminate_handler() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:88
      #4  0x00007f70c16358f6 in __cxxabiv1::__terminate(void (*)()) () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
      #5  0x00007f70c1635961 in std::terminate () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
      #6  0x00007f70c1635bf4 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x108ee10 <typeinfo for std::exception>, dest=0x7f70c1634b30 <std::exception::~exception()>) at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_throw.cc:95
      #7  0x00000000004cf077 in EPVBucket::setNumTotalItems (this=0x7f704a578700, kvstore=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_vb.cc:1198
      #8  0x0000000000749401 in Warmup::estimateDatabaseItemCount (this=0x7f7050b17900, shardId=<optimized out>) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/kvshard.h:72
      #9  0x0000000000753421 in WarmupEstimateDatabaseItemCount::run (this=0x7f70a41b4d30) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/warmup.cc:266
      #10 0x0000000000ab7502 in GlobalTask::execute() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/globaltask.cc:68
      #11 0x0000000000ab4605 in FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}::operator()() const (__closure=0x7f70ad3ea840) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:189
      #12 0x0000000000c08ec0 in operator() (this=0x7f70ad3ea840) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:416
      #13 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&) (this=this@entry=0x7f70bfa5a400, thread=..., 
          task=task@entry=<unknown type in /usr/lib/debug/opt/couchbase/bin/memcached-7.1.0-2211.x86_64.debug, CU 0xa39c047, DIE 0xa41ff7c>)
          at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/ThreadPoolExecutor.cpp:97
      #14 0x0000000000bf37fa in folly::CPUThreadPoolExecutor::threadRun (this=0x7f70bfa5a400, thread=...)
          at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/CPUThreadPoolExecutor.cpp:265
      #15 0x0000000000c0be79 in __invoke_impl<void, void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__t=<optimized out>, 
          __f=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:73
      #16 __invoke<void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__fn=<optimized out>)
          at /usr/local/include/c++/7.3.0/bits/invoke.h:95
      #17 __call<void, 0, 1> (__args=<optimized out>, this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:467
      #18 operator()<> (this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:551
      #19 folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) (p=...) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:401
      #20 0x0000000000ab4294 in operator() (this=0x7f70bfabfbc0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:47
      #21 operator() (__closure=0x7f70bfabfbc0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:47
      #22 folly::detail::function::FunctionTraits<void ()>::callBig<CBRegisteredThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::detail::function::Data&) (p=...)
          at /home/couchbase/jenkins/workspace/couchbase-server-unix/server_build/tlm/deps/folly.exploded/include/folly/Function.h:401
      #23 0x00007f70c165ed40 in execute_native_thread_routine () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/thread.cc:80
      #24 0x00007f70c3466ea5 in start_thread (arg=0x7f70ad3fc700) at pthread_create.c:307
      #25 0x00007f70c0da79fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
      (gdb) 
      

      cbcollect_info attached.

      Attachments

        1. info_threads.txt
          8 kB
        2. bt_full.txt
          15 kB
        3. thread_apply_all_bt.txt
          136 kB
        4. consoleText.txt
          3.86 MB
        5. Buckets_page.png
          Buckets_page.png
          196 kB
        6. Scope_page.png
          Scope_page.png
          214 kB
        7. Scopes_repro_toy_819180.png
          Scopes_repro_toy_819180.png
          218 kB
        8. Buckets_repro_toy_819180.png
          Buckets_repro_toy_819180.png
          345 kB
        9. consoleText_MB-50747_repro_toy_819180.zip
          15.83 MB

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty