Details
-
Bug
-
Resolution: Fixed
-
Critical
-
master, 6.5.0
-
Triaged
-
No
-
KV-Engine Mad-Hatter GA, KV Sprint 2019-12
Description
Issue is in PassiveDurabilityMonitor::completeSyncWrite.
296 void PassiveDurabilityMonitor::completeSyncWrite(
|
297 const StoredDocKey& key,
|
298 Resolution res,
|
299 boost::optional<uint64_t> prepareSeqno) {
|
300 auto s = state.wlock();
|
301
|
302 // If we are receiving a disk snapshot, we need to relax a few checks
|
303 // to account for deduplication. E.g., commits may appear to be out
|
304 // of order
|
305 bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
|
..
|
321 // If we can complete out of order, we have to check from the start of
|
322 // tracked writes as the HCS may have advanced past a prepare we have not
|
323 // seen a completion for
|
324 auto next = enforceOrderedCompletion
|
325 ? s->getIteratorNext(s->highCompletedSeqno.it)
|
326 : s->trackedWrites.begin();
|
327
|
328 if (!enforceOrderedCompletion) {
|
329 // Advance the iterator to the right item, it might not be the first
|
330 while (next != s->trackedWrites.end() && next->getKey() != key) {
|
331 next = s->getIteratorNext(next);
|
332 }
|
333 }
|
..
|
358 if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
|
359 std::stringstream ss;
|
360 ss << "Pending resolution for '" << *next
|
361 << "', but received unexpected " + to_string(res) + " for key "
|
362 << cb::tagUserData(key.to_string())
|
363 << " different prepare seqno: " << *prepareSeqno;
|
364 throwException<std::logic_error>(__func__, "" + ss.str());
|
365 }
|
..
|
397
|
398 // HCS may have moved, which could make some Prepare eligible for removal.
|
399 s->checkForAndRemovePrepares();
|
..
|
410 }
|
Scenario example
Replica receives the following for the same <key>:
- PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot(1, 2)
- <The flusher has not persisted anything yet>
- PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot(3, 4)
Important Note: when we process M:2 we do not remove PRE:1 from PDM::State::trackedWrites at line 399.
The reason is that we remove only locally-satisfied prepares, but PRE:1 is not locally-satisfied as the flusher has never persisted the entire Disk Snapshot(1, 2).
See comments in PassiveDurabilityMonitor::State::updateHighPreparedSeqno for details.
Focus on when we process M:4 now:
- prepareSeqno = 3 (as M:4 is commit for PRE:3)
- PDM::State::trackedWrites contains {PRE:1(completed), PRE:3(in-flight)}
- We execute into the block at 328-333. Next points to PRE:1(completed) after the block. <— This is the root cause of the issue
- Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.
So in general,
If:
- More than one Disk Snapshot is received by a replica node, and
- Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
- The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,
Then:
- The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
- As a result an exception is thrown
That will cause:
- The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
- Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)
In both cases, if a rebalance is in progress then it will fail.
The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.
Attachments
Issue Links
- is triggering
-
MB-37206 [SR - Test Only] Expand test scenarios for 'prepare completed but still tracked at Replica at OoO completion'
- Closed