Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48596

Runtime error thrown in magma::BasicFile::Sync (magma/util/file/file_impl_linux.cc:202) on Jepsen disk failure workload

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.1.0
    • 7.1.0
    • storage-engine
    • Couchbase Server 7.1.0-1317 (EE)
    • Untriaged
    • 1
    • Unknown

    Description

      Description

      The disk failure workload in Jepsen results in a runtime error being thrown in magma::BasicFile::Sync (magma/util/file/file_impl_linux.cc:202 (here) which in turn causes KvEngine to crash and restart.

      Note there is no consistency error, the history is consistent and the test actually passes from Jepsen's perspective.

      Repeating the workload using 'Couchstore' as the backend results in no crash.

      Steps to reproduce

      lein run test --nodes-file ./nodes2 --username root --ssh-private-key=./resources/my.key --workload=disk-failure --manipulate-disks --node-count=3 --replicas=2 --no-autofailover --disrupt-count=2 --kv-timeout=1.5 --durability=0:0:0:100 --storage-backend=magma

      Based on the logs, it appears as if there is no write access to the file-system.

      Cluster Configuration: 3 Kv Nodes with the storage-backend set to magma.

      How does the test fail the disk?

      Before the test begins, a devmapper device is created and is mounted at /opt/couchbase/var/lib/data (referred to as data-path) for each of the nodes.

      bash

      dd if=/dev/zero of=/tmp/cbdata.img bs=1M count=512
      losetup /dev/loop0 /tmp/cbdata.img
      dmsetup create cbdata --table 0 1048576 linear /dev/loop0 0
      mkfs.ext4 /dev/mapper/cbdata
      mkdir -p data-path
      mount -o noatime /dev/mapper/cbdata data-path
      

      During the test a disk failure is introduced by:

      bash

      dmsetup wipe_table cbdata noflush nolockfs
      

      This command makes I/O sent to disk fail:

      man dmsetup

      wipe_table device_name...  [-f|--force] [--noflush] [--nolockfs]
                    Wait for any I/O in-flight through the device to complete, then replace the table with a new table that fails any new I/O sent to the device.  If successful, this should release any devices held open by the device's table(s).
      

      To prevent various kernel in memory caching mechanisms from interfering:

      bash

      echo 3 > /proc/sys/vm/drop_caches
      

      What is the problem?

      Magma crashes which in turn causes KvEngine to crash.

      What's the expected behaviour?

      Being unable to write to disk should be handled gracefully and there should be no crash to be consistent with the behaviour with Couchstore.

      Logs
      20210924T163833.zip

      Also attached are gdb backtraces from the cbanalyze-core tool following an installation of the dbginfo package (so it should contain the symbols).
      782420fc.core.log
      3847df73.core.log

      Backtrace

      10.112.210.101:782420fc.core.log

      Thread 1 (LWP 26438):
      #0  0x00007f34428e25c9 in raise () from /lib64/libc.so.6
      #1  0x00007f34428e3cd8 in abort () from /lib64/libc.so.6
      #2  0x00007f344322163c in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/vterminate.cc:95
      #3  0x0000000000aa9bcb in backtrace_terminate_handler() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:88
      #4  0x00007f344322c8f6 in __cxxabiv1::__terminate(void (*)()) () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
      #5  0x00007f344322c961 in std::terminate () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
      #6  0x00007f344322cbf4 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f3404000940, tinfo=0xfc9360 <typeinfo for std::runtime_error>, dest=0x442e60 <_ZNSt13runtime_errorD1Ev@plt>) at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/libsupc++/eh_throw.cc:95
      #7  0x00000000004fa3d3 in magma::BasicFile::Sync (this=0x7f341823ef60) at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/util/file/file_impl_linux.cc:202
      #8  0x00000000009a6af4 in magma::FileWithStats::Sync() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/util/file/file_impl_stats.cc:63
      #9  0x00000000009210bb in magma::WAL::sync() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/wal/wal.cc:581
      #10 0x0000000000922f6b in magma::WAL::Sync() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/wal/wal.cc:539
      #11 0x000000000090e05b in magma::KVStore::WriteDocs(magma::WAL*, std::vector<magma::Magma::WriteOperation, std::allocator<magma::Magma::WriteOperation> > const&, std::function<void (magma::Magma::WriteOperation const&, bool, magma::Slice)>, std::function<magma::Status (std::vector<magma::Magma::WriteOperation, std::allocator<magma::Magma::WriteOperation> >&)>) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/magma/magma/kvstore/write.cc:172
      #12 0x00000000008fe535 in magma::Magma::Impl::WriteDocs(unsigned short, std::vector<magma::Magma::WriteOperation, std::allocator<magma::Magma::WriteOperation> > const&, unsigned int, std::function<void (magma::Magma::WriteOperation const&, bool, magma::Slice)>, std::function<magma::Status (std::vector<magma::Magma::WriteOperation, std::allocator<magma::Magma::WriteOperation> >&)>) () at /opt/gcc-10.2.0/include/c++/10.2.0/bits/std_function.h:248
      #13 0x00000000008fe6e2 in magma::Magma::WriteDocs(unsigned short, std::vector<magma::Magma::WriteOperation, std::allocator<magma::Magma::WriteOperation> > const&, unsigned int, std::function<void (magma::Magma::WriteOperation const&, bool, magma::Slice)>, std::function<magma::Status (std::vector<magma::Magma::WriteOperation, std::allocator<magma::Magma::WriteOperation> >&)>) () at /opt/gcc-10.2.0/include/c++/10.2.0/bits/std_function.h:248
      #14 0x000000000085d11f in MagmaMemoryTrackingProxy::WriteDocs(unsigned short, std::vector<magma::Magma::WriteOperation, std::allocator<magma::Magma::WriteOperation> > const&, unsigned int, std::function<void (magma::Magma::WriteOperation const&, bool, magma::Slice)>, std::function<magma::Status (std::vector<magma::Magma::WriteOperation, std::allocator<magma::Magma::WriteOperation> >&)>) () at /opt/gcc-10.2.0/include/c++/10.2.0/bits/std_function.h:248
      #15 0x00000000008475cc in MagmaKVStore::saveDocs(MagmaKVStoreTransactionContext&, VB::Commit&, kvstats_ctx&) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/include/memcached/vbucket.h:62
      #16 0x000000000084043f in MagmaKVStore::commit(std::unique_ptr<TransactionContext, std::default_delete<TransactionContext> >, VB::Commit&) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/kvstore/magma-kvstore/magma-kvstore.cc:637
      #17 0x00000000007ea105 in EPBucket::commit(KVStoreIface&, std::unique_ptr<TransactionContext, std::default_delete<TransactionContext> >, VB::Commit&) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:929
      #18 0x00000000007f1105 in EPBucket::flushVBucket_UNLOCKED(LockedVBucketPtr) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:801
      #19 0x00000000007f14ff in EPBucket::flushVBucket(Vbid) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/ep_bucket.cc:378
      #20 0x00000000006cba20 in Flusher::flushVB (this=0x7f342c285c00) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:285
      #21 0x00000000006cc370 in Flusher::step(GlobalTask*) () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/flusher.cc:200
      #22 0x0000000000a1c222 in GlobalTask::execute() () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/globaltask.cc:68
      #23 0x0000000000a19345 in FollyExecutorPool::TaskProxy::scheduleViaCPUPool()::{lambda()#2}::operator()() const (__closure=0x7f341b7ec840) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:189
      #24 0x0000000000b63c30 in operator() (this=0x7f341b7ec840) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:416
      #25 folly::ThreadPoolExecutor::runTask(std::shared_ptr<folly::ThreadPoolExecutor::Thread> const&, folly::ThreadPoolExecutor::Task&&) (this=0x7f3441653000, thread=..., task=<unknown type in /usr/lib/debug/opt/couchbase/bin/memcached.debug, CU 0x65a4350, DIE 0x65e99cc>)
          at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/ThreadPoolExecutor.cpp:97
      #26 0x0000000000b4b9ea in folly::CPUThreadPoolExecutor::threadRun (this=0x7f3441653000, thread=...) at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/executors/CPUThreadPoolExecutor.cpp:265
      #27 0x0000000000b66be9 in __invoke_impl<void, void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__t=<optimized out>, __f=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:73
      #28 __invoke<void (folly::ThreadPoolExecutor::*&)(std::shared_ptr<folly::ThreadPoolExecutor::Thread>), folly::ThreadPoolExecutor*&, std::shared_ptr<folly::ThreadPoolExecutor::Thread>&> (__fn=<optimized out>) at /usr/local/include/c++/7.3.0/bits/invoke.h:95
      #29 __call<void, 0, 1> (__args=<optimized out>, this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:467
      #30 operator()<> (this=<optimized out>) at /usr/local/include/c++/7.3.0/functional:551
      #31 folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) (p=...)
          at /home/couchbase/jenkins/workspace/cbdeps-platform-build-old/deps/packages/build/folly/folly-prefix/src/folly/folly/Function.h:401
      #32 0x0000000000a18fd4 in operator() (this=0x7f34425ce380) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:47
      #33 operator() (__closure=0x7f34425ce380) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/executor/folly_executorpool.cc:47
      #34 folly::detail::function::FunctionTraits<void ()>::callBig<CBRegisteredThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::detail::function::Data&) (p=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/server_build/tlm/deps/folly.exploded/include/folly/Function.h:401
      #35 0x00007f3443255d40 in execute_native_thread_routine () at /tmp/deploy/objdir/../gcc-10.2.0/libstdc++-v3/src/c++11/thread.cc:80
      #36 0x00007f3445079df3 in start_thread () from /lib64/libpthread.so.0
      #37 0x00007f34429a31ad in clone () from /lib64/libc.so.6

      A similar backtrace can be found in 3847df73.core.log for node 10.112.210.103. No crashes were found in 10.112.210.102.
       

      Attachments

        1. 20210924T163833.zip
          29.56 MB
        2. 3847df73.core.log
          110 kB
        3. 782420fc.core.log
          111 kB
        For Gerrit Dashboard: MB-48596
        # Subject Branch Project Status CR V

        Activity

          People

            asad.zaidi Asad Zaidi (Inactive)
            asad.zaidi Asad Zaidi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty