Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37063

Replica may fails at receiving multiple consecutive Disk Checkpoints

    XMLWordPrintable

Details

    • Triaged
    • No
    • KV-Engine Mad-Hatter GA, KV Sprint 2019-12

    Description

      Issue is in PassiveDurabilityMonitor::completeSyncWrite.

      296  void PassiveDurabilityMonitor::completeSyncWrite(
      297          const StoredDocKey& key,
      298          Resolution res,
      299          boost::optional<uint64_t> prepareSeqno) {
      300      auto s = state.wlock();
      301  
      302      // If we are receiving a disk snapshot, we need to relax a few checks
      303      // to account for deduplication. E.g., commits may appear to be out
      304      // of order
      305      bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
      ..
      321      // If we can complete out of order, we have to check from the start of
      322      // tracked writes as the HCS may have advanced past a prepare we have not
      323      // seen a completion for
      324      auto next = enforceOrderedCompletion
      325                          ? s->getIteratorNext(s->highCompletedSeqno.it)
      326                          : s->trackedWrites.begin();
      327  
      328      if (!enforceOrderedCompletion) {
      329          // Advance the iterator to the right item, it might not be the first
      330          while (next != s->trackedWrites.end() && next->getKey() != key) {
      331              next = s->getIteratorNext(next);
      332          }
      333      }
      ..  
      358      if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
      359          std::stringstream ss;
      360          ss << "Pending resolution for '" << *next
      361             << "', but received unexpected " + to_string(res) + " for key "
      362             << cb::tagUserData(key.to_string())
      363             << " different prepare seqno: " << *prepareSeqno;
      364          throwException<std::logic_error>(__func__, "" + ss.str());
      365      }
      ..
      397  
      398      // HCS may have moved, which could make some Prepare eligible for removal.
      399      s->checkForAndRemovePrepares();
      ..
      410  }
      

      Scenario example

      Replica receives the following for the same <key>:

      • PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot(1, 2)
      • <The flusher has not persisted anything yet>
      • PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot(3, 4)

       

      Important Note: when we process M:2 we do not remove PRE:1 from PDM::State::trackedWrites at line 399.
      The reason is that we remove only locally-satisfied prepares, but PRE:1 is not locally-satisfied as the flusher has never persisted the entire Disk Snapshot(1, 2).
      See comments in PassiveDurabilityMonitor::State::updateHighPreparedSeqno for details.

       

      Focus on when we process M:4 now:

      • prepareSeqno = 3 (as M:4 is commit for PRE:3)
      • PDM::State::trackedWrites contains {PRE:1(completed), PRE:3(in-flight)}
      • We execute into the block at 328-333. Next points to PRE:1(completed) after the block.                     <— This is the root cause of the issue
      • Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

       

      So in general,

      If:

      • More than one Disk Snapshot is received by a replica node, and
      • Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
      • The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

      Then:

      • The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
      • As a result an exception is thrown

      That will cause:

      • The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
      • Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

      In both cases, if a rebalance is in progress then it will fail.
      The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
      Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-37063
          # Subject Branch Project Status CR V

          Activity

            paolo.cocchi Paolo Cocchi created issue -
            drigby Dave Rigby made changes -
            Field Original Value New Value
            Affects Version/s master [ 13407 ]
            drigby Dave Rigby made changes -
            Description TBD
            owend Daniel Owen made changes -
            Affects Version/s Mad-Hatter [ 15037 ]
            owend Daniel Owen made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]
            owend Daniel Owen made changes -
            Labels 6.5.1-candidate
            drigby Dave Rigby made changes -
            Description TBD
            *TODO - expand details of the scenario*

            If:
            * More than one Disk Snapshot is received by a replica node, and
            * Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
            * The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,
             
            then:

            * The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot, as it incorrectly thinks that key is still incomplete (even though it has been completed), because the first disk snapshot's prepare is still in the Passive DurabilityMonitor.
            * As a result an exception is thrown and the DCP replication stream will be closed.
            * This will cause the DCP connection to be closed. If a rebalance is in progress then it will fail; the vBucket is in steady-state then the connection _should_ be re-established by ns_server and items will retry.

            Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.


            drigby Dave Rigby made changes -
            Fix Version/s 6.5.1 [ 16622 ]
            Labels 6.5.1-candidate
            drigby Dave Rigby made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]
            drigby Dave Rigby made changes -
            Sprint KV Sprint 2019-12 [ 939 ]
            drigby Dave Rigby made changes -
            Rank Ranked higher
            drigby Dave Rigby made changes -
            Rank Ranked higher
            paolo.cocchi Paolo Cocchi made changes -
            Description *TODO - expand details of the scenario*

            If:
            * More than one Disk Snapshot is received by a replica node, and
            * Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
            * The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,
             
            then:

            * The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot, as it incorrectly thinks that key is still incomplete (even though it has been completed), because the first disk snapshot's prepare is still in the Passive DurabilityMonitor.
            * As a result an exception is thrown and the DCP replication stream will be closed.
            * This will cause the DCP connection to be closed. If a rebalance is in progress then it will fail; the vBucket is in steady-state then the connection _should_ be re-established by ns_server and items will retry.

            Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.


            Issue is in PassiveDurabilityMonitor::completeSyncWrite.
            {code}296 void PassiveDurabilityMonitor::completeSyncWrite(
            297 const StoredDocKey& key,
            298 Resolution res,
            299 boost::optional<uint64_t> prepareSeqno) {
            300 auto s = state.wlock();
            301
            302 // If we are receiving a disk snapshot, we need to relax a few checks
            303 // to account for deduplication. E.g., commits may appear to be out
            304 // of order
            305 bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
            ..
            321 // If we can complete out of order, we have to check from the start of
            322 // tracked writes as the HCS may have advanced past a prepare we have not
            323 // seen a completion for
            324 auto next = enforceOrderedCompletion
            325 ? s->getIteratorNext(s->highCompletedSeqno.it)
            326 : s->trackedWrites.begin();
            327
            328 if (!enforceOrderedCompletion) {
            329 // Advance the iterator to the right item, it might not be the first
            330 while (next != s->trackedWrites.end() && next->getKey() != key) {
            331 next = s->getIteratorNext(next);
            332 }
            333 }
            ..
            358 if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
            359 std::stringstream ss;
            360 ss << "Pending resolution for '" << *next
            361 << "', but received unexpected " + to_string(res) + " for key "
            362 << cb::tagUserData(key.to_string())
            363 << " different prepare seqno: " << *prepareSeqno;
            364 throwException<std::logic_error>(__func__, "" + ss.str());
            365 }
            ..
            397
            398 // HCS may have moved, which could make some Prepare eligible for removal.
            399 s->checkForAndRemovePrepares();
            ..
            410 }
            {code}
            *Scenario example*

            Replica receives the following for the same <key>:
             * PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot
             * <The flusher has not persisted anything yet>
             * PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot

            Note: Given that the flusher has not persisted yet, then when we process M:2 we do *not* remove PRE:1 from PDM::State::trackedWrites at line 399.

            Focus on when we process M:4 now:
             * prepareSeqno = 3 (as M:4 is commit for PRE:3)
             * PDM::State::trackedWrites contains \{PRE:1(completed), PRE:3(in-flight)}
             * We execute into the block at 328-333. Next points to PRE:1(completed). <— This is the root cause of the issue
             * Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

             

            So in general,

            If:
             * More than one Disk Snapshot is received by a replica node, and
             * Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
             * The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

            Then:
             * The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
             * As a result an exception is thrown

            That will cause:
             * The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
             * Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

            In both cases, if a rebalance is in progress then it will fail.
             The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
             Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.
            paolo.cocchi Paolo Cocchi made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            paolo.cocchi Paolo Cocchi made changes -
            Fix Version/s 6.5.0 [ 16624 ]
            Fix Version/s 6.5.1 [ 16622 ]
            paolo.cocchi Paolo Cocchi made changes -
            Triage Untriaged [ 10351 ] Triaged [ 10350 ]
            paolo.cocchi Paolo Cocchi made changes -
            Link This issue causes MB-37183 [ MB-37183 ]
            paolo.cocchi Paolo Cocchi made changes -
            Description Issue is in PassiveDurabilityMonitor::completeSyncWrite.
            {code}296 void PassiveDurabilityMonitor::completeSyncWrite(
            297 const StoredDocKey& key,
            298 Resolution res,
            299 boost::optional<uint64_t> prepareSeqno) {
            300 auto s = state.wlock();
            301
            302 // If we are receiving a disk snapshot, we need to relax a few checks
            303 // to account for deduplication. E.g., commits may appear to be out
            304 // of order
            305 bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
            ..
            321 // If we can complete out of order, we have to check from the start of
            322 // tracked writes as the HCS may have advanced past a prepare we have not
            323 // seen a completion for
            324 auto next = enforceOrderedCompletion
            325 ? s->getIteratorNext(s->highCompletedSeqno.it)
            326 : s->trackedWrites.begin();
            327
            328 if (!enforceOrderedCompletion) {
            329 // Advance the iterator to the right item, it might not be the first
            330 while (next != s->trackedWrites.end() && next->getKey() != key) {
            331 next = s->getIteratorNext(next);
            332 }
            333 }
            ..
            358 if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
            359 std::stringstream ss;
            360 ss << "Pending resolution for '" << *next
            361 << "', but received unexpected " + to_string(res) + " for key "
            362 << cb::tagUserData(key.to_string())
            363 << " different prepare seqno: " << *prepareSeqno;
            364 throwException<std::logic_error>(__func__, "" + ss.str());
            365 }
            ..
            397
            398 // HCS may have moved, which could make some Prepare eligible for removal.
            399 s->checkForAndRemovePrepares();
            ..
            410 }
            {code}
            *Scenario example*

            Replica receives the following for the same <key>:
             * PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot
             * <The flusher has not persisted anything yet>
             * PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot

            Note: Given that the flusher has not persisted yet, then when we process M:2 we do *not* remove PRE:1 from PDM::State::trackedWrites at line 399.

            Focus on when we process M:4 now:
             * prepareSeqno = 3 (as M:4 is commit for PRE:3)
             * PDM::State::trackedWrites contains \{PRE:1(completed), PRE:3(in-flight)}
             * We execute into the block at 328-333. Next points to PRE:1(completed). <— This is the root cause of the issue
             * Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

             

            So in general,

            If:
             * More than one Disk Snapshot is received by a replica node, and
             * Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
             * The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

            Then:
             * The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
             * As a result an exception is thrown

            That will cause:
             * The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
             * Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

            In both cases, if a rebalance is in progress then it will fail.
             The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
             Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.
            Issue is in PassiveDurabilityMonitor::completeSyncWrite.
            {code}296 void PassiveDurabilityMonitor::completeSyncWrite(
            297 const StoredDocKey& key,
            298 Resolution res,
            299 boost::optional<uint64_t> prepareSeqno) {
            300 auto s = state.wlock();
            301
            302 // If we are receiving a disk snapshot, we need to relax a few checks
            303 // to account for deduplication. E.g., commits may appear to be out
            304 // of order
            305 bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
            ..
            321 // If we can complete out of order, we have to check from the start of
            322 // tracked writes as the HCS may have advanced past a prepare we have not
            323 // seen a completion for
            324 auto next = enforceOrderedCompletion
            325 ? s->getIteratorNext(s->highCompletedSeqno.it)
            326 : s->trackedWrites.begin();
            327
            328 if (!enforceOrderedCompletion) {
            329 // Advance the iterator to the right item, it might not be the first
            330 while (next != s->trackedWrites.end() && next->getKey() != key) {
            331 next = s->getIteratorNext(next);
            332 }
            333 }
            ..
            358 if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
            359 std::stringstream ss;
            360 ss << "Pending resolution for '" << *next
            361 << "', but received unexpected " + to_string(res) + " for key "
            362 << cb::tagUserData(key.to_string())
            363 << " different prepare seqno: " << *prepareSeqno;
            364 throwException<std::logic_error>(__func__, "" + ss.str());
            365 }
            ..
            397
            398 // HCS may have moved, which could make some Prepare eligible for removal.
            399 s->checkForAndRemovePrepares();
            ..
            410 }
            {code}
            *Scenario example*

            Replica receives the following for the same <key>:
             * PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot(1, 2)
             * <The flusher has not persisted anything yet>
             * PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot(3, 4)

            *Important Note: when we process M:2 we do not remove PRE:1 from PDM::State::trackedWrites at line 399.*
            The reason is that we remove only locally-satisfied prepares, but PRE:1 is not locally satisfied as the flusher has never persisted the entire Disk Snapshot(1, 2).
            See comments in PassiveDurabilityMonitor::State::updateHighPreparedSeqno for details.

            Focus on when we process M:4 now:
             * prepareSeqno = 3 (as M:4 is commit for PRE:3)
             * PDM::State::trackedWrites contains \{PRE:1(completed), PRE:3(in-flight)}
             * We execute into the block at 328-333. Next points to PRE:1(completed). <— This is the root cause of the issue
             * Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

             

            So in general,

            If:
             * More than one Disk Snapshot is received by a replica node, and
             * Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
             * The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

            Then:
             * The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
             * As a result an exception is thrown

            That will cause:
             * The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
             * Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

            In both cases, if a rebalance is in progress then it will fail.
             The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
             Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.
            paolo.cocchi Paolo Cocchi made changes -
            Description Issue is in PassiveDurabilityMonitor::completeSyncWrite.
            {code}296 void PassiveDurabilityMonitor::completeSyncWrite(
            297 const StoredDocKey& key,
            298 Resolution res,
            299 boost::optional<uint64_t> prepareSeqno) {
            300 auto s = state.wlock();
            301
            302 // If we are receiving a disk snapshot, we need to relax a few checks
            303 // to account for deduplication. E.g., commits may appear to be out
            304 // of order
            305 bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
            ..
            321 // If we can complete out of order, we have to check from the start of
            322 // tracked writes as the HCS may have advanced past a prepare we have not
            323 // seen a completion for
            324 auto next = enforceOrderedCompletion
            325 ? s->getIteratorNext(s->highCompletedSeqno.it)
            326 : s->trackedWrites.begin();
            327
            328 if (!enforceOrderedCompletion) {
            329 // Advance the iterator to the right item, it might not be the first
            330 while (next != s->trackedWrites.end() && next->getKey() != key) {
            331 next = s->getIteratorNext(next);
            332 }
            333 }
            ..
            358 if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
            359 std::stringstream ss;
            360 ss << "Pending resolution for '" << *next
            361 << "', but received unexpected " + to_string(res) + " for key "
            362 << cb::tagUserData(key.to_string())
            363 << " different prepare seqno: " << *prepareSeqno;
            364 throwException<std::logic_error>(__func__, "" + ss.str());
            365 }
            ..
            397
            398 // HCS may have moved, which could make some Prepare eligible for removal.
            399 s->checkForAndRemovePrepares();
            ..
            410 }
            {code}
            *Scenario example*

            Replica receives the following for the same <key>:
             * PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot(1, 2)
             * <The flusher has not persisted anything yet>
             * PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot(3, 4)

            *Important Note: when we process M:2 we do not remove PRE:1 from PDM::State::trackedWrites at line 399.*
            The reason is that we remove only locally-satisfied prepares, but PRE:1 is not locally satisfied as the flusher has never persisted the entire Disk Snapshot(1, 2).
            See comments in PassiveDurabilityMonitor::State::updateHighPreparedSeqno for details.

            Focus on when we process M:4 now:
             * prepareSeqno = 3 (as M:4 is commit for PRE:3)
             * PDM::State::trackedWrites contains \{PRE:1(completed), PRE:3(in-flight)}
             * We execute into the block at 328-333. Next points to PRE:1(completed). <— This is the root cause of the issue
             * Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

             

            So in general,

            If:
             * More than one Disk Snapshot is received by a replica node, and
             * Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
             * The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

            Then:
             * The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
             * As a result an exception is thrown

            That will cause:
             * The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
             * Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

            In both cases, if a rebalance is in progress then it will fail.
             The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
             Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.
            Issue is in PassiveDurabilityMonitor::completeSyncWrite.
            {code}296 void PassiveDurabilityMonitor::completeSyncWrite(
            297 const StoredDocKey& key,
            298 Resolution res,
            299 boost::optional<uint64_t> prepareSeqno) {
            300 auto s = state.wlock();
            301
            302 // If we are receiving a disk snapshot, we need to relax a few checks
            303 // to account for deduplication. E.g., commits may appear to be out
            304 // of order
            305 bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
            ..
            321 // If we can complete out of order, we have to check from the start of
            322 // tracked writes as the HCS may have advanced past a prepare we have not
            323 // seen a completion for
            324 auto next = enforceOrderedCompletion
            325 ? s->getIteratorNext(s->highCompletedSeqno.it)
            326 : s->trackedWrites.begin();
            327
            328 if (!enforceOrderedCompletion) {
            329 // Advance the iterator to the right item, it might not be the first
            330 while (next != s->trackedWrites.end() && next->getKey() != key) {
            331 next = s->getIteratorNext(next);
            332 }
            333 }
            ..
            358 if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
            359 std::stringstream ss;
            360 ss << "Pending resolution for '" << *next
            361 << "', but received unexpected " + to_string(res) + " for key "
            362 << cb::tagUserData(key.to_string())
            363 << " different prepare seqno: " << *prepareSeqno;
            364 throwException<std::logic_error>(__func__, "" + ss.str());
            365 }
            ..
            397
            398 // HCS may have moved, which could make some Prepare eligible for removal.
            399 s->checkForAndRemovePrepares();
            ..
            410 }
            {code}
            *Scenario example*

            Replica receives the following for the same <key>:
             * PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot(1, 2)
             * <The flusher has not persisted anything yet>
             * PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot(3, 4)

             

            *Important Note: when we process M:2 we do not remove PRE:1 from PDM::State::trackedWrites at line 399.*
             The reason is that we remove only locally-satisfied prepares, but PRE:1 is not locally-satisfied as the flusher has never persisted the entire Disk Snapshot(1, 2).
             See comments in PassiveDurabilityMonitor::State::updateHighPreparedSeqno for details.

             

            Focus on when we process M:4 now:
             * prepareSeqno = 3 (as M:4 is commit for PRE:3)
             * PDM::State::trackedWrites contains \{PRE:1(completed), PRE:3(in-flight)}
             * We execute into the block at 328-333. Next points to PRE:1(completed). <— This is the root cause of the issue
             * Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

             

            So in general,

            If:
             * More than one Disk Snapshot is received by a replica node, and
             * Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
             * The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

            Then:
             * The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
             * As a result an exception is thrown

            That will cause:
             * The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
             * Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

            In both cases, if a rebalance is in progress then it will fail.
             The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
             Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.
            paolo.cocchi Paolo Cocchi made changes -
            Description Issue is in PassiveDurabilityMonitor::completeSyncWrite.
            {code}296 void PassiveDurabilityMonitor::completeSyncWrite(
            297 const StoredDocKey& key,
            298 Resolution res,
            299 boost::optional<uint64_t> prepareSeqno) {
            300 auto s = state.wlock();
            301
            302 // If we are receiving a disk snapshot, we need to relax a few checks
            303 // to account for deduplication. E.g., commits may appear to be out
            304 // of order
            305 bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
            ..
            321 // If we can complete out of order, we have to check from the start of
            322 // tracked writes as the HCS may have advanced past a prepare we have not
            323 // seen a completion for
            324 auto next = enforceOrderedCompletion
            325 ? s->getIteratorNext(s->highCompletedSeqno.it)
            326 : s->trackedWrites.begin();
            327
            328 if (!enforceOrderedCompletion) {
            329 // Advance the iterator to the right item, it might not be the first
            330 while (next != s->trackedWrites.end() && next->getKey() != key) {
            331 next = s->getIteratorNext(next);
            332 }
            333 }
            ..
            358 if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
            359 std::stringstream ss;
            360 ss << "Pending resolution for '" << *next
            361 << "', but received unexpected " + to_string(res) + " for key "
            362 << cb::tagUserData(key.to_string())
            363 << " different prepare seqno: " << *prepareSeqno;
            364 throwException<std::logic_error>(__func__, "" + ss.str());
            365 }
            ..
            397
            398 // HCS may have moved, which could make some Prepare eligible for removal.
            399 s->checkForAndRemovePrepares();
            ..
            410 }
            {code}
            *Scenario example*

            Replica receives the following for the same <key>:
             * PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot(1, 2)
             * <The flusher has not persisted anything yet>
             * PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot(3, 4)

             

            *Important Note: when we process M:2 we do not remove PRE:1 from PDM::State::trackedWrites at line 399.*
             The reason is that we remove only locally-satisfied prepares, but PRE:1 is not locally-satisfied as the flusher has never persisted the entire Disk Snapshot(1, 2).
             See comments in PassiveDurabilityMonitor::State::updateHighPreparedSeqno for details.

             

            Focus on when we process M:4 now:
             * prepareSeqno = 3 (as M:4 is commit for PRE:3)
             * PDM::State::trackedWrites contains \{PRE:1(completed), PRE:3(in-flight)}
             * We execute into the block at 328-333. Next points to PRE:1(completed). <— This is the root cause of the issue
             * Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

             

            So in general,

            If:
             * More than one Disk Snapshot is received by a replica node, and
             * Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
             * The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

            Then:
             * The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
             * As a result an exception is thrown

            That will cause:
             * The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
             * Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

            In both cases, if a rebalance is in progress then it will fail.
             The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
             Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.
            Issue is in PassiveDurabilityMonitor::completeSyncWrite.
            {code}296 void PassiveDurabilityMonitor::completeSyncWrite(
            297 const StoredDocKey& key,
            298 Resolution res,
            299 boost::optional<uint64_t> prepareSeqno) {
            300 auto s = state.wlock();
            301
            302 // If we are receiving a disk snapshot, we need to relax a few checks
            303 // to account for deduplication. E.g., commits may appear to be out
            304 // of order
            305 bool enforceOrderedCompletion = !vb.isReceivingDiskSnapshot();
            ..
            321 // If we can complete out of order, we have to check from the start of
            322 // tracked writes as the HCS may have advanced past a prepare we have not
            323 // seen a completion for
            324 auto next = enforceOrderedCompletion
            325 ? s->getIteratorNext(s->highCompletedSeqno.it)
            326 : s->trackedWrites.begin();
            327
            328 if (!enforceOrderedCompletion) {
            329 // Advance the iterator to the right item, it might not be the first
            330 while (next != s->trackedWrites.end() && next->getKey() != key) {
            331 next = s->getIteratorNext(next);
            332 }
            333 }
            ..
            358 if (prepareSeqno && next->getBySeqno() != static_cast<int64_t>(*prepareSeqno)) {
            359 std::stringstream ss;
            360 ss << "Pending resolution for '" << *next
            361 << "', but received unexpected " + to_string(res) + " for key "
            362 << cb::tagUserData(key.to_string())
            363 << " different prepare seqno: " << *prepareSeqno;
            364 throwException<std::logic_error>(__func__, "" + ss.str());
            365 }
            ..
            397
            398 // HCS may have moved, which could make some Prepare eligible for removal.
            399 s->checkForAndRemovePrepares();
            ..
            410 }
            {code}
            *Scenario example*

            Replica receives the following for the same <key>:
             * PRE:1 and M:2 (logic CMT:2) in a Disk Snapshot(1, 2)
             * <The flusher has not persisted anything yet>
             * PRE:3 and M:4 (logic CMT:4) in a second Disk Snapshot(3, 4)

             

            *Important Note: when we process M:2 we do not remove PRE:1 from PDM::State::trackedWrites at line 399.*
             The reason is that we remove only locally-satisfied prepares, but PRE:1 is not locally-satisfied as the flusher has never persisted the entire Disk Snapshot(1, 2).
             See comments in PassiveDurabilityMonitor::State::updateHighPreparedSeqno for details.

             

            Focus on when we process M:4 now:
             * prepareSeqno = 3 (as M:4 is commit for PRE:3)
             * PDM::State::trackedWrites contains \{PRE:1(completed), PRE:3(in-flight)}
             * We execute into the block at 328-333. *Next points to PRE:1(completed) after the block.*                     <— This is the root cause of the issue
             * Given that next-byseqno(1) != prepareSeqno(3) then we enter the block at 358-365 and throw.

             

            So in general,

            If:
             * More than one Disk Snapshot is received by a replica node, and
             * Each Disk Snapshot contains a completed SyncWrite (Committed or Aborted) for the same key, and
             * The flusher has not completed flushing the first Disk snapshot before the the Commit/Abort in the second Disk Snapshot is received,

            Then:
             * The replica will incorrectly reject the DCP_COMMIT/ABORT in the second snapshot
             * As a result an exception is thrown

            That will cause:
             * The DCP connection to be closed, if the DCP_COMMIT/ABORT is processed in a front-end thread (common case)
             * Or memcached crash, if the DCP_COMMIT/ABORT is processed in a bg-thread (eg, buffered message processed in the DcpConsumerTask)

            In both cases, if a rebalance is in progress then it will fail.
             The vBucket is in steady-state then the connection should be re-established (after the node is restarted if it had crashed) by ns_server and nodes will retry.
             Once the flusher completes flushing the first Disk Snapshot, then the problem should no longer occur.
            owend Daniel Owen made changes -
            Fix Version/s Mad-Hatter [ 15037 ]
            paolo.cocchi Paolo Cocchi made changes -
            Is this a Regression? Unknown [ 10452 ] No [ 10451 ]
            owend Daniel Owen made changes -
            Sprint KV Sprint 2019-12 [ 939 ] KV-Engine Mad-Hatter GA [ 910 ]
            owend Daniel Owen made changes -
            Rank Ranked lower
            owend Daniel Owen made changes -
            Assignee Paolo Cocchi [ paolo.cocchi ] Daniel Owen [ owend ]
            owend Daniel Owen made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Paolo Cocchi [ paolo.cocchi ]
            owend Daniel Owen made changes -
            Due Date 09/Dec/19
            dfinlay Dave Finlay made changes -
            Labels approved-for-mad-hatter
            dfinlay Dave Finlay made changes -
            Link This issue blocks MB-36676 [ MB-36676 ]
            drigby Dave Rigby made changes -
            Fix Version/s 6.5.0 [ 16624 ]
            drigby Dave Rigby made changes -
            Sprint KV-Engine Mad-Hatter GA [ 910 ] KV-Engine Mad-Hatter GA, KV Sprint 2019-12 [ 910, 939 ]
            drigby Dave Rigby made changes -
            Rank Ranked higher
            drigby Dave Rigby made changes -
            Rank Ranked higher
            paolo.cocchi Paolo Cocchi made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]
            paolo.cocchi Paolo Cocchi made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            paolo.cocchi Paolo Cocchi made changes -
            Link This issue is triggering MB-37206 [ MB-37206 ]
            wayne Wayne Siu made changes -
            Link This issue blocks MB-37192 [ MB-37192 ]
            wayne Wayne Siu made changes -
            Link This issue blocks MB-37192 [ MB-37192 ]

            People

              paolo.cocchi Paolo Cocchi
              paolo.cocchi Paolo Cocchi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty