Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42016

[Collections] - minidumps seen during Multi node graceful failover + rebalance out + collection CRUD

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Yes

    Description

      Script to Repro

      ./testrunner -i /tmp/win10-bucket-ops.ini rerun=False,quota_percent=95,crash_warning=True -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_rebalance_out,nodes_init=5,nodes_failover=2,bucket_spec=multi_bucket.buckets_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=during,quota_percent=80,GROUP=failover_with_collection_crud
      

      Steps to Repro
      1. Create 5 node cluster
      2020-10-13 00:00:37,207 | test | INFO | pool-3-thread-6 | [table_view:display:72] Rebalance Overview
      ------------------------------------

      Nodes Services Status

      ------------------------------------

      172.23.98.196 kv Cluster node
      172.23.98.195 None <--- IN —
      172.23.120.206 None <--- IN —
      172.23.104.186 None <--- IN —
      172.23.121.10 None <--- IN —

      ------------------------------------

      2. Create buckets/scope/collections/data
      2020-10-13 00:12:21,321 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
      --------------------------------------------------------------------------

      Bucket Type Replicas Durability TTL Items RAM Quota RAM Used Disk Used

      --------------------------------------------------------------------------

      bucket1 couchbase 3 none 0 3000 1048576000 253857232 359987046
      bucket2 ephemeral 3 none 0 3000 1048576000 363499152 170
      default couchbase 3 none 0 500000 10485760000 718104144 583601498

      --------------------------------------------------------------------------

      3. Start data load again with CRUD on collections

      4. Do a graceful failover of 2 nodes(172.23.104.186 and 172.23.121.10)

      2020-10-13 00:12:27,401 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:157] Starting rebalance operation of type : graceful_failover_rebalance_out
      2020-10-13 00:14:27,693 | test  | INFO    | MainThread | [collections_rebalance:wait_for_failover_or_assert:129] 1 nodes failed over as expected in 0.0490000247955 seconds
      2020-10-13 00:16:48,086 | test  | INFO    | MainThread | [collections_rebalance:wait_for_failover_or_assert:129] 2 nodes failed over as expected in 20.1369998455 seconds
      

      5. Start rebalance out of those 2 nodes. Rebalance fails and minidumps are seen on 172.23.120.206.

      2020-10-13 00:19:09,019 | test  | ERROR   | pool-3-thread-24 | [rest_client:_rebalance_status_and_progress:1479] {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'status': u'none'} - rebalance failed
      

      grep CRITICAL on 172.23.120.206

      [root@localhost logs]# grep CRITICAL memcached.log.0000*
      memcached.log.000041.txt:2020-10-13T00:19:04.665334-07:00 CRITICAL *** Fatal error encountered during exception handling ***
      memcached.log.000041.txt:2020-10-13T00:19:04.665384-07:00 CRITICAL Caught unhandled std::exception-derived exception. what(): void Collections::VB::Manifest::DroppedCollections::remove(CollectionID, uint64_t) The collection cannot be found collection:0xcd seqno:1092
      memcached.log.000041.txt:2020-10-13T00:19:04.854355-07:00 CRITICAL Breakpad caught a crash (Couchbase version 7.0.0-3383). Writing crash dump to /opt/couchbase/var/lib/couchbase/crash/761d0348-ae89-43ef-b46ea584-ca2d7f1e.dmp before terminating.
      memcached.log.000041.txt:2020-10-13T00:19:04.854370-07:00 CRITICAL Stack backtrace of crashed thread:
      memcached.log.000041.txt:2020-10-13T00:19:04.854513-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x19741d]
      memcached.log.000041.txt:2020-10-13T00:19:04.854534-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler12GenerateDumpEPNS0_12CrashContextE+0x3ea) [0x400000+0x1ac3aa]
      memcached.log.000041.txt:2020-10-13T00:19:04.854544-07:00 CRITICAL     /opt/couchbase/bin/memcached(_ZN15google_breakpad16ExceptionHandler13SignalHandlerEiP9siginfo_tPv+0xb8) [0x400000+0x1ac6e8]
      memcached.log.000041.txt:2020-10-13T00:19:04.854550-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7f39fb9f4000+0xf630]
      memcached.log.000041.txt:2020-10-13T00:19:04.854569-07:00 CRITICAL     /lib64/libc.so.6(gsignal+0x37) [0x7f39fb626000+0x36387]
      memcached.log.000041.txt:2020-10-13T00:19:04.854585-07:00 CRITICAL     /lib64/libc.so.6(abort+0x148) [0x7f39fb626000+0x37a78]
      memcached.log.000041.txt:2020-10-13T00:19:04.854618-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125) [0x7f39fc129000+0x91195]
      memcached.log.000041.txt:2020-10-13T00:19:04.854635-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x1a6d72]
      memcached.log.000041.txt:2020-10-13T00:19:04.854651-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f39fc129000+0x8ef86]
      memcached.log.000041.txt:2020-10-13T00:19:04.854670-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f39fc129000+0x8efd1]
      memcached.log.000041.txt:2020-10-13T00:19:04.854689-07:00 CRITICAL     /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f39fc129000+0x8f213]
      memcached.log.000041.txt:2020-10-13T00:19:04.854699-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f39ff711000+0x2a29cc]
      memcached.log.000041.txt:2020-10-13T00:19:04.854706-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f39ff711000+0x2a53cf]
      memcached.log.000041.txt:2020-10-13T00:19:04.854711-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f39ff711000+0x275003]
      memcached.log.000041.txt:2020-10-13T00:19:04.854889-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f39ff711000+0x122978]
      memcached.log.000041.txt:2020-10-13T00:19:04.854896-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f39ff711000+0x1264fd]
      memcached.log.000041.txt:2020-10-13T00:19:04.854903-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f39ff711000+0x1806bc]
      memcached.log.000041.txt:2020-10-13T00:19:04.854908-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f39ff711000+0x181899]
      memcached.log.000041.txt:2020-10-13T00:19:04.854913-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f39ff711000+0x1845c3]
      memcached.log.000041.txt:2020-10-13T00:19:04.854918-07:00 CRITICAL     /opt/couchbase/bin/../lib/libep.so() [0x7f39ff711000+0x861bf]
      memcached.log.000041.txt:2020-10-13T00:19:04.854924-07:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f39fe1ac000+0x10947]
      memcached.log.000041.txt:2020-10-13T00:19:04.854930-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7f39fb9f4000+0x7ea5]
      memcached.log.000041.txt:2020-10-13T00:19:04.854957-07:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7f39fb626000+0xfe8dd]
      

      bt full of 761d0348-ae89-43ef-b46ea584-ca2d7f1e.dmp on 172.23.120.206
      See bt_full_multi_node_graceful_rebalance_out.txt

      cbcollect_info attached. This test passed on 7.0.0-3342.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          jwalker Jim Walker added a comment - - edited

          Another oddity, the function which throws has two exceptions with different messages.

          memcached.log has the first message "The collection cannot be found collection..." - however the gdb full backtrace bt_full_multi_node_graceful_rebalance_out.txt says the throw occured at line 1028, which is the second throw, which has the message "The collection seqno cannot be found collection..."

          void Manifest::DroppedCollections::remove(CollectionID cid, uint64_t seqno) {
              auto dropItr = droppedCollections.find(cid);
              if (dropItr == droppedCollections.end()) {
                  // bad time - every call of get should be from a flush which
                  // had items we created for a collection which is open or was open.
                  // to not find the collection at all means we have flushed an item
                  // to a collection that never existed.
                  throw std::logic_error(std::string(__PRETTY_FUNCTION__) +
                                         " The collection cannot be found collection:" +
                                         cid.to_string() +
                                         " seqno:" + std::to_string(seqno));
              }
           
              auto& [key, vector] = *dropItr;
              (void)key;
              bool erased = false;
              for (auto itr = vector.begin(); itr != vector.end(); itr++) {
                  if (seqno == itr->end) {
                      vector.erase(itr);
                      erased = true;
                      break;
                  }
              }
           
              if (erased) {
                  // clear the map?
                  if (vector.empty()) {
                      droppedCollections.erase(dropItr);
                  }
                  return;
              }
           
              throw std::logic_error(std::string(__PRETTY_FUNCTION__) +
                                     " The collection seqno cannot be found collection:" +
                                     cid.to_string() + " seqno:" + std::to_string(seqno));
          }
          

          jwalker Jim Walker added a comment - - edited Another oddity, the function which throws has two exceptions with different messages. memcached.log has the first message "The collection cannot be found collection..." - however the gdb full backtrace bt_full_multi_node_graceful_rebalance_out.txt says the throw occured at line 1028, which is the second throw, which has the message "The collection seqno cannot be found collection..." void Manifest::DroppedCollections::remove(CollectionID cid, uint64_t seqno) { auto dropItr = droppedCollections.find(cid); if (dropItr == droppedCollections.end()) { // bad time - every call of get should be from a flush which // had items we created for a collection which is open or was open. // to not find the collection at all means we have flushed an item // to a collection that never existed. throw std::logic_error(std::string(__PRETTY_FUNCTION__) + " The collection cannot be found collection:" + cid.to_string() + " seqno:" + std::to_string(seqno)); }   auto& [key, vector] = *dropItr; (void)key; bool erased = false; for (auto itr = vector.begin(); itr != vector.end(); itr++) { if (seqno == itr->end) { vector.erase(itr); erased = true; break; } }   if (erased) { // clear the map? if (vector.empty()) { droppedCollections.erase(dropItr); } return; }   throw std::logic_error(std::string(__PRETTY_FUNCTION__) + " The collection seqno cannot be found collection:" + cid.to_string() + " seqno:" + std::to_string(seqno)); }

          Build couchbase-server-7.0.0-3407 contains kv_engine commit a029b1e with commit message:
          MB-42016: Increase detail in DroppedCollections exceptions

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-3407 contains kv_engine commit a029b1e with commit message: MB-42016 : Increase detail in DroppedCollections exceptions
          jwalker Jim Walker added a comment -

          Balakumaran Gopal Can you re-run on couchbase-server-7.0.0-3407 to see if we can get a reproduction?

          jwalker Jim Walker added a comment - Balakumaran Gopal Can you re-run on couchbase-server-7.0.0-3407 to see if we can get a reproduction?

          Build couchbase-server-7.0.0-3458 contains kv_engine commit c926b49 with commit message:
          MB-42016: Prevent flusher from missing insert into droppedCollections

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-3458 contains kv_engine commit c926b49 with commit message: MB-42016 : Prevent flusher from missing insert into droppedCollections

          Validated this test on a loop 15 times.

          Logs :- http://qa.sc.couchbase.com/job/temp_rebalance_even/2859/
          All of them passed. Marking this closed.

          Balakumaran.Gopal Balakumaran Gopal added a comment - Validated this test on a loop 15 times. Logs :- http://qa.sc.couchbase.com/job/temp_rebalance_even/2859/ All of them passed. Marking this closed.

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty