Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-16496

Date race in EPStore::persistVBState() potentially leading to inconstancy on disk state

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 4.5.0
    • .master, 3.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 4.0.0
    • couchbase-bucket
    • Security Level: Public
    • Untriaged
    • Unknown
    • KV: Oct 4 - Oct 24

    Description

      When running under ThreadSanitizer it reports a data race between reading VBucket::purgeSeqno from EPStore::persistVBState() and writing it from EPStore::compactVBucket() - see full report below.

      persistVBState is performing a dirty read in purgeSeqno, which I believe could result in an inconsistant vbucket_state object written to disk. Specifically, getState, getMaxCas and getDriftCounter look all to be read dirtily, and may be inconsistent compared to snapshot_range.

      Extract of code in question (http://src.couchbase.org/source/xref/trunk/ep-engine/src/ep.cc#1199):

      bool EventuallyPersistentStore::persistVBState(const Priority &priority,
                                                     uint16_t vbid) {
          ...
       
          snapshot_range_t range;
          vb->getPersistedSnapshot(range);
          vbucket_state vb_state(vb->getState(), chkId, 0, vb->getHighSeqno(),
                                 vb->getPurgeSeqno(), range.start, range.end,
                                 vb->getMaxCas(), vb->getDriftCounter(),
                                 failovers);
       
          bool inverse = false;
          LockHolder lh(vb_mutexes[vbid], true /*tryLock*/);
          ...
          if (rwUnderlying->snapshotVBucket(vbid, vb_state, &kvcb)) {
          ...
      

      Note we construct vb_state before we acquire the lock on that vBucket.

      I believe this is potentially a data corruption issue as we could write an inconsistent vBucket state to disk; which if we crashed and then read from disk on restart could be incorrect.

      ThreadSanitizer output:

      WARNING: ThreadSanitizer: data race (pid=29921)
        Write of size 8 at 0x7d680001f580 by thread T5 (mutexes: write M12734):
          #0 VBucket::setPurgeSeqno() ep-engine/src/vbucket.h:215:9 (ep.so+0x000000086204)
          #1 EventuallyPersistentStore::compactVBucket() ep-engine/src/ep.cc:1584 (ep.so+0x000000086204)
          #2 CompactVBucketTask::run() ep-engine/src/tasks.cc:94:12 (ep.so+0x00000012971e)
          #3 ExecutorThread::run() ep-engine/src/executorthread.cc:115:26 (ep.so+0x0000000ea41d)
          #4 launch_executor_thread() ep-engine/src/executorthread.cc:33:9 (ep.so+0x0000000e9fe5)
          #5 platform_thread_wrap platform/src/cb_pthreads.c:23:5 (libplatform.so.0.1.0+0x000000004161)
       
        Previous read of size 8 at 0x7d680001f580 by thread T7:
          #0 VBucket::getPurgeSeqno() ep-engine/src/vbucket.h:211:16 (ep.so+0x0000000821d3)
          #1 EventuallyPersistentStore::persistVBState() ep-engine/src/ep.cc:1217 (ep.so+0x0000000821d3)
          #2 VBStatePersistTask::run() ep-engine/src/tasks.cc:86:12 (ep.so+0x000000129636)
          #3 ExecutorThread::run() ep-engine/src/executorthread.cc:115:26 (ep.so+0x0000000ea41d)
          #4 launch_executor_thread() ep-engine/src/executorthread.cc:33:9 (ep.so+0x0000000e9fe5)
          #5 platform_thread_wrap platform/src/cb_pthreads.c:23:5 (libplatform.so.0.1.0+0x000000004161)
       
        Location is heap block of size 1392 at 0x7d680001f200 allocated by main thread:
          #0 operator new() <null> (engine_testapp+0x00000045cded)
          #1 EventuallyPersistentStore::setVBucketState() ep-engine/src/ep.cc:1300:30 (ep.so+0x000000082b1a)
          #2 EventuallyPersistentEngine::setVBucketState() ep-engine/src/ep_engine.h:718:16 (ep.so+0x0000000ca308)
          #3 setVBucket()) ep-engine/src/ep_engine.cc:884 (ep.so+0x0000000ca308)
          #4 processUnknownCommand()) ep-engine/src/ep_engine.cc:1178 (ep.so+0x0000000ca308)
          #5 EvpUnknownCommand()) ep-engine/src/ep_engine.cc:1389:38 (ep.so+0x0000000aafc8)
          #6 mock_unknown_command()) memcached/programs/engine_testapp/engine_testapp.cc:380:19 (engine_testapp+0x0000004c56b9)
          #7 set_vbucket_state() ep-engine/tests/ep_test_apis.cc:607:9 (ep_testsuite.so+0x0000000a3a4b)
          #8 test_setup() ep-engine/tests/ep_testsuite_common.cc:146:28 (ep_testsuite.so+0x00000009cdda)
          #9 execute_test() memcached/programs/engine_testapp/engine_testapp.cc:1085:47 (engine_testapp+0x0000004c4103)
          #10 main memcached/programs/engine_testapp/engine_testapp.cc:1439 (engine_testapp+0x0000004c4103)
      

      Steps to reproduce

      1. Build with ThreadSanitizer - see tlm/README.md for details, something like CC=clang-3.6 CXX=clang++-3.6 make EXTRA_CMAKE_OPTIONS="-D CB_THREADSANITIZER=1" -j8
      2. Run ep_testsuite 341: TSAN_OPTIONS="external_symbolizer_path=/usr/bin/llvm-symbolizer-3.6 suppressions=/home/couchbase/couchbase/tlm/tsan.suppressions second_deadlock_stack=1" "/home/couchbase/couchbase/build/memcached/engine_testapp" "-E" "ep.so" "-T" "ep_testsuite.so" "-v" "-e" "flushall_enabled=truel;ht_size=13;ht_locks=7" -C 341

      Attachments

        For Gerrit Dashboard: MB-16496
        # Subject Branch Project Status CR V

        Activity

          People

            chiyoung Chiyoung Seo (Inactive)
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty