Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37063

Replica may fail at receiving multiple consecutive Disk Checkpoints

    XMLWordPrintable

Details

    • Triaged
    • No
    • KV-Engine Mad-Hatter GA, KV Sprint 2019-12

    Description

      Issue is in PassiveDurabilityMonitor::completeSyncWrite.

      296  void PassiveDurabilityMonitor::completeSyncWrite(
      297          const StoredDocKey& key,
      298          Resolution res,
      299          boost::optional<uint64_t> prepareSeqno) {
      300      auto s = state.wlock();
      301  
      302      // If we are receiving a disk snapshot, we need to relax a few checks
      303      // to account for deduplication. E.g., commits may appear to be out
      304      // of order
      305      bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
      ..
      321      // If we can complete out of order, we have to check from the start of
      322      // tracked writes as the HCS may have advanced past a prepare we have not
      323      // seen a completion for
      324      auto next = enforceOrderedCompletion
      325                          ? s->getIteratorNext(s->highCompletedSeqno.it)
      326                          : s->trackedWrites.begin();
      327  
      328      if (!enforceOrderedCompletion) {
      329          // Advance the iterator to the right item, it might not be the first
      330          while (next != s->trackedWrites.end() && next->getKey() != key) {
      331              next = s->getIteratorNext(next);
      332          }
      333      }
      ..  
      358      if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
      359          std::stringstream ss;
      360          ss << "Pending resolution for '" << *next
      361             << "', but received unexpected " + to_string(res) + " for key "
      362             << cb::tagUserData(key.to_string())
      363             << " different prepare seqno: " << *prepareSeqno;
      364          throwException<std::logic_error>(__func__, "" + ss.str());
      365      }
      ..
      397  
      398      // HCS may have moved, which could make some Prepare eligible for removal.
      399      s->checkForAndRemovePrepares();
      ..
      410  }
      

      Scenario example

      Replica receives the following for the same <key>:

      • PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot(1, 2)
      • <The flusher has not persisted anything yet>
      • PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot(3, 4)

       

      Important Note: when we process M:2 we do not remove PRE:1 from PDM::State::trackedWrites at line 399.
      The reason is that we remove only locally-satisfied prepares, but PRE:1 is not locally-satisfied as the flusher has never persisted the entire Disk Snapshot(1, 2).
      See comments in PassiveDurabilityMonitor::State::updateHighPreparedSeqno for details.

       

      Focus on when we process M:4 now:

      • prepareSeqno = 3 (as M:4 is commit for PRE:3)
      • PDM::State::trackedWrites contains {PRE:1(completed), PRE:3(in-flight)}
      • We execute into the block at 328-333. Next points to PRE:1(completed) after the block.                     <— This is the root cause of the issue
      • Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

       

      So in general,

      If:

      • More than one Disk Snapshot is received by a replica node, and
      • Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
      • The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

      Then:

      • The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
      • As a result an exception is thrown

      That will cause:

      • The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
      • Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

      In both cases, if a rebalance is in progress then it will fail.
      The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
      Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              paolo.cocchi Paolo Cocchi
              paolo.cocchi Paolo Cocchi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty