Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37063

Replica may fail at receiving multiple consecutive Disk Checkpoints

    XMLWordPrintable

Details

    • Triaged
    • No
    • KV-Engine Mad-Hatter GA, KV Sprint 2019-12

    Description

      Issue is in PassiveDurabilityMonitor::completeSyncWrite.

      296  void PassiveDurabilityMonitor::completeSyncWrite(
      297          const StoredDocKey& key,
      298          Resolution res,
      299          boost::optional<uint64_t> prepareSeqno) {
      300      auto s = state.wlock();
      301  
      302      // If we are receiving a disk snapshot, we need to relax a few checks
      303      // to account for deduplication. E.g., commits may appear to be out
      304      // of order
      305      bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
      ..
      321      // If we can complete out of order, we have to check from the start of
      322      // tracked writes as the HCS may have advanced past a prepare we have not
      323      // seen a completion for
      324      auto next = enforceOrderedCompletion
      325                          ? s->getIteratorNext(s->highCompletedSeqno.it)
      326                          : s->trackedWrites.begin();
      327  
      328      if (!enforceOrderedCompletion) {
      329          // Advance the iterator to the right item, it might not be the first
      330          while (next != s->trackedWrites.end() && next->getKey() != key) {
      331              next = s->getIteratorNext(next);
      332          }
      333      }
      ..  
      358      if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
      359          std::stringstream ss;
      360          ss << "Pending resolution for '" << *next
      361             << "', but received unexpected " + to_string(res) + " for key "
      362             << cb::tagUserData(key.to_string())
      363             << " different prepare seqno: " << *prepareSeqno;
      364          throwException<std::logic_error>(__func__, "" + ss.str());
      365      }
      ..
      397  
      398      // HCS may have moved, which could make some Prepare eligible for removal.
      399      s->checkForAndRemovePrepares();
      ..
      410  }
      

      Scenario example

      Replica receives the following for the same <key>:

      • PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot(1, 2)
      • <The flusher has not persisted anything yet>
      • PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot(3, 4)

       

      Important Note: when we process M:2 we do not remove PRE:1 from PDM::State::trackedWrites at line 399.
      The reason is that we remove only locally-satisfied prepares, but PRE:1 is not locally-satisfied as the flusher has never persisted the entire Disk Snapshot(1, 2).
      See comments in PassiveDurabilityMonitor::State::updateHighPreparedSeqno for details.

       

      Focus on when we process M:4 now:

      • prepareSeqno = 3 (as M:4 is commit for PRE:3)
      • PDM::State::trackedWrites contains {PRE:1(completed), PRE:3(in-flight)}
      • We execute into the block at 328-333. Next points to PRE:1(completed) after the block.                     <— This is the root cause of the issue
      • Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

       

      So in general,

      If:

      • More than one Disk Snapshot is received by a replica node, and
      • Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
      • The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

      Then:

      • The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
      • As a result an exception is thrown

      That will cause:

      • The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
      • Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

      In both cases, if a rebalance is in progress then it will fail.
      The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
      Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-37063
          # Subject Branch Project Status CR V

          Activity

            People

              paolo.cocchi Paolo Cocchi
              paolo.cocchi Paolo Cocchi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty