Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44098

[Collections] : decodeManifest: duplicate collection:0xce in stored data ------ collection CRUD + multi node hard failover + full recovery + rebalance

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Yes
    • KV-Engine 2021-Feb

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,quota_percent=95,crash_warning=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_hard_failover_recovery,data_load_stage=during,quota_percent=80,nodes_failover=2,recovery_type=full,rerun=False,nodes_init=5,bucket_spec=multi_bucket.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections'
      

      Steps to Repro
      1) Create a 5 node cluster
      2021-02-03 20:25:50,934 | test | INFO | pool-1-thread-6 | [table_view:display:72] Rebalance Overview
      ------------------------------------

      Nodes Services Status

      ------------------------------------

      172.23.98.196 kv Cluster node
      172.23.98.195 None <--- IN —
      172.23.121.10 None <--- IN —
      172.23.104.186 None <--- IN —
      172.23.120.206 None <--- IN —

      ------------------------------------

      2)Create buckets/scopes/collections/data
      -------------------------------------------------------------------------

      Bucket Type Replicas Durability TTL Items RAM Quota RAM Used Disk Used

      -------------------------------------------------------------------------

      bucket1 couchbase 3 none 0 3000 1048576000 218057960 322905355
      bucket2 ephemeral 3 none 0 3000 1048576000 329473336 170
      default couchbase 3 none 0 250000 5242880000 470890808 460819585

      -------------------------------------------------------------------------

      3)Hard faiilover 2 nodes.

      2021-02-03 20:29:36,849 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:600] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.206 port:8091 ssh_username:root]
      2021-02-03 20:29:50,240 | test  | INFO    | pool-1-thread-23 | [rest_client:monitorRebalance:1438] Rebalance done. Taken 8.05900001526 seconds to complete
      2021-02-03 20:29:50,243 | test  | INFO    | pool-1-thread-23 | [common_lib:sleep:22] Sleep 8.05900001526 seconds. Reason: Wait after rebalance complete
      2021-02-03 20:31:58,346 | test  | INFO    | MainThread | [collections_rebalance:wait_for_failover_or_assert:224] 1 nodes failed over as expected in 0.0409998893738 seconds
      2021-02-03 20:32:10,351 | test  | INFO    | pool-1-thread-8 | [rest_client:monitorRebalance:1438] Rebalance done. Taken 8.07899999619 seconds to complete
      2021-02-03 20:32:10,355 | test  | INFO    | pool-1-thread-8 | [common_lib:sleep:22] Sleep 8.07899999619 seconds. Reason: Wait after rebalance complete
      2021-02-03 20:34:18,476 | test  | INFO    | MainThread | [collections_rebalance:wait_for_failover_or_assert:224] 2 nodes failed over as expected in 0.0379998683929 seconds
      

      4)Do full recovery and rebalance

      2021-02-03 20:34:44,459 | test  | WARNING | MainThread | [rest_client:get_nodes:1696] 172.23.104.186 - Node not part of cluster inactiveFailed
      2021-02-03 20:34:44,459 | test  | WARNING | MainThread | [rest_client:get_nodes:1696] 172.23.120.206 - Node not part of cluster inactiveFailed
      

      Rebalance fails and we lot of mindumps. The one of interest is shown below
      grep CRITICAL on 172.23.121.10

      memcached.log.000015.txt:2021-02-03T20:36:33.577265-08:00 CRITICAL Caught unhandled std::exception-derived exception. what(): decodeManifest: duplicate collection:0xce in stored data
      memcached.log.000015.txt:2021-02-03T20:36:33.577964-08:00 CRITICAL *** Fatal error encountered during exception handling ***
      memcached.log.000015.txt:2021-02-03T20:36:33.602241-08:00 CRITICAL *** Fatal error encountered during exception handling ***
      memcached.log.000015.txt:2021-02-03T20:36:33.618348-08:00 CRITICAL *** Fatal error encountered during exception handling ***
      memcached.log.000015.txt:2021-02-03T20:36:33.985955-08:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-4374). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/92e07269-f018-44cc-04ca8aa4-cdf08df0.dmp before terminating.
      memcached.log.000015.txt:2021-02-03T20:36:33.985973-08:00 CRITICAL Stack backtrace of crashed thread:
      memcached.log.000015.txt:2021-02-03T20:36:33.986765-08:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x145bbd]
      memcached.log.000015.txt:2021-02-03T20:36:33.986791-08:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x15b3fa]
      memcached.log.000015.txt:2021-02-03T20:36:33.986814-08:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x15b738]
      memcached.log.000015.txt:2021-02-03T20:36:33.986830-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f23971e1000+0xf630]
      memcached.log.000015.txt:2021-02-03T20:36:33.986873-08:00 CRITICAL     /lib64/libc.so.6(gsignal+0x37) [0x7f2396e13000+0x36387]
      memcached.log.000015.txt:2021-02-03T20:36:33.986915-08:00 CRITICAL     /lib64/libc.so.6(abort+0x148) [0x7f2396e13000+0x37a78]
      memcached.log.000015.txt:2021-02-03T20:36:33.986972-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7f2397916000+0x91195]
      memcached.log.000015.txt:2021-02-03T20:36:33.986995-08:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x155632]
      memcached.log.000015.txt:2021-02-03T20:36:33.987042-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2397916000+0x8ef86]
      memcached.log.000015.txt:2021-02-03T20:36:33.987088-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2397916000+0x8efd1]
      memcached.log.000015.txt:2021-02-03T20:36:33.987112-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x16f2f3]
      memcached.log.000015.txt:2021-02-03T20:36:33.987133-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x169352]
      memcached.log.000015.txt:2021-02-03T20:36:33.987157-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x2e9bd6]
      memcached.log.000015.txt:2021-02-03T20:36:33.987186-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x2d20ca]
      memcached.log.000015.txt:2021-02-03T20:36:33.987210-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x2eccf9]
      memcached.log.000015.txt:2021-02-03T20:36:33.987231-08:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f239b247000+0x167793]
      memcached.log.000015.txt:2021-02-03T20:36:33.987297-08:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2397916000+0xb9dcf]
      memcached.log.000015.txt:2021-02-03T20:36:33.987311-08:00 CRITICAL     /lib64/libpthread.so.0() [0x7f23971e1000+0x7ea5]
      memcached.log.000015.txt:2021-02-03T20:36:33.987365-08:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f2396e13000+0xfe8dd]
      

      cbcollect_info attached. This was not seen on 7.0.0-4342.
      This bug could be related to MB-44097 as its the same test and the same minidumps of MB-44097 is also seen here.

      Attachments

        1. bt_full.txt
          8 kB
        2. consoleText.txt
          370 kB
        3. info_threads.txt
          4 kB
        4. thread_apply_all_bt.txt
          94 kB
        5. vb12_open.json
          46 kB

        Issue Links

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty