Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.5.0
-
Triaged
-
No
-
KV-Engine Mad-Hatter Beta
Description
Problem:
- Replicas are not supposed to ack a seqno higher than that of a received Persist level prepare until that prepare is persisted.
- Disk snapshots may dedupe prepares; a Majority level Prepare may dedupe a Persist level prepare.
- A replica receiving a disk snapshot does not know if there were Persist level prepares that have been deduped.
- Acking a seqno tells the active that all Prepares of seqno<=ackSeqno have met their durability requirements locally on the replica.
If a replica receives a Majority level Prepare from a disk snapshot, it is potentially incorrect to ack that seqno - there may be a previous Persist level Prepare that was deduped, and we might not have not yet persisted the appropriate value for that key. To wit, we have "jumped" the durability fence.
To clarify with a scenario:
If the active receives the following ops (for one key)
PRE(Persist):1 CMT:2 PRE(Majority):3
The replica will see instead
SET:2 PRE(Majority):3
(NB: Set sent instead because of MB-34789)
The replica would ack seqno 3 at the snapshot end, without regard to whether the SET or PRE have been persisted (because Majority level Prepares are immediately satisfied locally on a replica, because they are in memory which is all that is needed).
This state is unacceptable; if the active fails over this replica may be promoted if it has seqno acked at least as far as any other replicas. As the new active, if it dies and comes back up, we have lost the correct value for that key; it was not persisted to disk when we acked. Therefore, we may have broken the durability contract if we reported SUCCESS to the client after committing the prepared value.
Solution:
By effectively "promoting" all prepares received during a disk snapshot to Persist level we can ensure we will not implicitly acknowledge any deduped Persist level Prepares before their value has been persisted - once the later "promoted" prepare is persisted, we know the preceding SET has been persisted also.
Attachments
Issue Links
- has to be done after
-
MB-34906 PDM might never seqnoAck if persistence lags behind (1/3) [ETA 2019/7/10]
- Closed
- has to be done before
-
MB-34516 Replica should handle deduped commits if backfilling from disk (3/3) [ETA 2019/7/19]
- Closed
- is duplicated by
-
MB-34947 Rebalance failed with "mover crashed" in bucket_replica update
- Closed