Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44097

Crash when collection disk size underflows with concurrent flush & compaction

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Yes
    • KV-Engine 2021-Feb

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,quota_percent=95,crash_warning=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,data_load_stage=during,quota_percent=80,nodes_failover=2,recovery_type=full,rerun=False,nodes_init=5,bucket_spec=multi_bucket.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections'
      

      Steps to Repro
      1) Create a 5 node cluster
      2021-02-03 19:29:07,426 | test | INFO | pool-1-thread-6 | [table_view:display:72] Rebalance Overview

      Nodes Services Status
      172.23.98.196 kv Cluster node
      172.23.98.195 None <--- IN —
      172.23.121.10 None <--- IN —
      172.23.104.186 None <--- IN —
      172.23.120.206 None <--- IN —

      2) Create buckets/scopes/collections/data.
      2021-02-03 19:34:17,336 | test | INFO | MainThread | [table_view:display:72] Bucket statistics

      Bucket Type Replicas Durability TTL Items RAM Quota RAM Used Disk Used
      bucket1 couchbase 3 none 0 3000 1048576000 218121544 314572404
      bucket2 ephemeral 3 none 0 3000 1048576000 331036392 170
      default couchbase 3 none 0 500000 10485760000 706696616 558261158

      3) Hard failover 2 nodes.

      2021-02-03 19:34:24,032 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:600] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.206 port:8091 ssh_username:root]
      2021-02-03 19:34:36,875 | test  | INFO    | pool-1-thread-20 | [rest_client:monitorRebalance:1438] Rebalance done. Taken 8.33500003815 seconds to complete
      2021-02-03 19:34:36,887 | test  | INFO    | pool-1-thread-20 | [common_lib:sleep:22] Sleep 8.33500003815 seconds. Reason: Wait after rebalance complete
      2021-02-03 19:36:45,301 | test  | INFO    | MainThread | [collections_rebalance:wait_for_failover_or_assert:224] 1 nodes failed over as expected in 0.0710000991821 seconds
      2021-02-03 19:36:59,030 | test  | INFO    | pool-1-thread-25 | [rest_client:monitorRebalance:1438] Rebalance done. Taken 8.6819999218 seconds to complete
      2021-02-03 19:36:59,039 | test  | INFO    | pool-1-thread-25 | [common_lib:sleep:22] Sleep 8.6819999218 seconds. Reason: Wait after rebalance complete
      2021-02-03 19:39:08,802 | test  | INFO    | MainThread | [collections_rebalance:wait_for_failover_or_assert:224] 2 nodes failed over as expected in 1.07400012016 seconds
      

      4) Do full recovery + rebalance. Rebalance fails.

      2021-02-03 19:39:53,246 | test  | WARNING | MainThread | [rest_client:get_nodes:1696] 172.23.104.186 - Node not part of cluster inactiveFailed
      2021-02-03 19:39:53,249 | test  | WARNING | MainThread | [rest_client:get_nodes:1696] 172.23.120.206 - Node not part of cluster inactiveFailed
      

      We see the following coredumps on 172.23.98.196, 172.23.98.195 and 172.23.121.10.

      grep CRITICAL memcached on 172.23.98.196(97539118-64f3-442a-bb4c8ab6-c98e1f02.dmp )

      [root@s81706 logs]# grep CRITICAL memcached.log.0000*
      memcached.log.000011.txt:2021-02-03T19:39:45.584027-08:00 CRITICAL *** Fatal error encountered during exception handling ***
      memcached.log.000011.txt:2021-02-03T19:39:45.585012-08:00 CRITICAL Caught unhandled std::exception-derived exception. what(): ThrowExceptionUnderflowPolicy current:0 arg:1
      memcached.log.000011.txt:2021-02-03T19:39:46.065505-08:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-4374). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/97539118-64f3-442a-bb4c8ab6-c98e1f02.dmp before terminating.
      memcached.log.000011.txt:2021-02-03T19:39:46.065583-08:00 CRITICAL Stack backtrace of crashed thread:
      memcached.log.000011.txt:2021-02-03T19:39:46.065909-08:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x145bbd]
      memcached.log.000011.txt:2021-02-03T19:39:46.065948-08:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x15b3fa]
      memcached.log.000011.txt:2021-02-03T19:39:46.065969-08:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x15b738]
      memcached.log.000011.txt:2021-02-03T19:39:46.065989-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f98c92b1000+0xf630]
      memcached.log.000011.txt:2021-02-03T19:39:46.066055-08:00 CRITICAL     /lib64/libc.so.6(gsignal+0x37) [0x7f98c8ee3000+0x36387]
      memcached.log.000011.txt:2021-02-03T19:39:46.066112-08:00 CRITICAL     /lib64/libc.so.6(abort+0x148) [0x7f98c8ee3000+0x37a78]
      memcached.log.000011.txt:2021-02-03T19:39:46.066190-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7f98c99e6000+0x91195]
      memcached.log.000011.txt:2021-02-03T19:39:46.066215-08:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x155632]
      memcached.log.000011.txt:2021-02-03T19:39:46.066259-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f98c99e6000+0x8ef86]
      memcached.log.000011.txt:2021-02-03T19:39:46.066292-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f98c99e6000+0x8efd1]
      memcached.log.000011.txt:2021-02-03T19:39:46.066315-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x16f2f3]
      memcached.log.000011.txt:2021-02-03T19:39:46.066330-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x169352]
      memcached.log.000011.txt:2021-02-03T19:39:46.066352-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x2e9bd6]
      memcached.log.000011.txt:2021-02-03T19:39:46.066372-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x2d20ca]
      memcached.log.000011.txt:2021-02-03T19:39:46.066397-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x2eccf9]
      memcached.log.000011.txt:2021-02-03T19:39:46.066422-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f98cd317000+0x167793]
      memcached.log.000011.txt:2021-02-03T19:39:46.066501-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f98c99e6000+0xb9dcf]
      memcached.log.000011.txt:2021-02-03T19:39:46.066517-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f98c92b1000+0x7ea5]
      memcached.log.000011.txt:2021-02-03T19:39:46.066774-08:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f98c8ee3000+0xfe8dd]
      

      cbcollect_info attached. This was not seen on 7.0.0-4342.

      Attachments

        1. bt_full_all_threads.txt
          92 kB
        2. bt_full.txt
          8 kB
        3. consoleText.txt
          1.96 MB
        4. info_threads.txt
          4 kB
        5. test_2_sync_writes.zip
          64.43 MB
        6. test_2.zip
          84.66 MB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty