Details
-
Improvement
-
Resolution: Unresolved
-
Critical
-
6.5.0
Description
Summary
Committed SyncWrites may be lost in a multi-failure scenario if the second failure occurs just after a rollback.
Status (As of 29th May 2019):
From Paolo Cocchi's comment below:
Status
HPS moves at snapshot-boundaries in PassiveDM (
MB-34197closed):
- Level Majority and MajorityAndPersistOnMaster Prepares satisfied (up to the durability-fence, if any) only after the complete snapshot is received on PassiveStream
- Level PersistToMajority Prepares satisfied only after the complete snapshot is persisted
The next step is to ensure that at Replica promotion "KV Engine sets the branch point as the endpoint of the highest snapshot received" (see the MB description).
Currently we are backing-off on proceeding with this as we have found further (possible) issues on the Compaction side (https://docs.google.com/document/d/1tdWCh2KgkFEm1e0pgu_wuupSnRb4xIGsv92CLmVIP8M/edit).
Details
Currently (v6.0), when a vBucket is failed over and a replica promoted to active, a failover table branch is created at the highest complete persisted snapshot. Any other replica which re-connects to this new active will be required to rollback to this failover branch point, however if the new active also fails just after the replica has rolled back past (and discard) a Prepared SyncWrite, that sync write could be lost.
Consider the following scenario, with 4 nodes (3 replicas), majority=3.
Setup
- Active node N0 has snapshots [1:SET, 2:SET], [3:PREPARE(x, level=majority)], [4:COMMIT(x)], vb UUID=AAAA. No mutations have yet been persisted to disk.
- Replicas N1 & N2 have both received up to seqno 3; but only persisted up to seqno 2.
- Replica N3 has not received anything yet.
Scenario
- Active crashes, ns_server triggers auto-failover.
- ns_server selects N1 as the new active (N2 is identical, it could also select that one, doesn't matter for this scenario).
- Upon promotion to active, N1 creates a new failover table at highest (complete) persisted snapshot (which is 2) with a new vb UUID = BBBB.
- N2 re-connects to the new active, and performs an ADD_STREAM negotiation. Given the seqnos diverge at 2, N2 will be told to rollback to seqno=2.
- N2 issues the rollback, discarding 3:PREPARE.
- N1 crashes, ns_server triggers auto-failover.
- ns_server selects N2 as new active (higher of the two N2 & N3), however 3:PREPARE has been lost due to rollback, breaking the SyncWrite contract.
(Full details can be found under Q: Replicas Rolling Back to Failover Branch Points Can Lead to Data Loss in the design doc.)
Solution
To avoid this problem the following changes are needed:
- In addition to other criteria governing acknowledgement of durable writes, replicas should only acknowledge durable writes once the last mutation in a snapshot has been received by the replica.
- As a consequence of (1), the high_prepared_seqno will always be at snapshot boundaries (as we don't consider any Prepare satisfied until we have a complete snapshot).
- KV Engine should set the branch point as the endpoint of the highest snapshot it’s aware of - not necessarily the persisted snapshot.
- As consequence of point (3),at complete-snapshot received we can move the HPS onto Level Majority and MajorityAndPersistOnMaster Prepares up to the durability-fence (if any), but we must wait for the complete-snapshot being persisted for the HPS covering PersistToMajority Prepares. Locally-satisfied PersistToMajority Prepares could be rolled-back in certain failure scenarios otherwise (similar to the example described above).
Example Scenarios for HPS updates (points 1, 2 and 4 above)
Scenario 1 - Two level=majority Prepares
SNAP(1,2) 1:PRE(majority) 2:PRE(majority)
|
|
Replica mem: --S-------1-------2---------------------------
|
x x
|
Today: A:1 A:2
|
|
x
|
Tomorrow: A:1,2
|
Scenario 2 - Level=majority, level=persistMajority Prepares
SNAP(1,2) 1:PRE(majority) 2:PRE(persistMajority)
|
|
Replica mem: --S-------1-------2--------------------------
|
Replica disk: ------------------------------[1]------[2]--
|
x x
|
Today: A:1 A:2
|
|
x x
|
Tomorrow: A:1 A:2
|
Scenario 3 - Level=persistMajority, level=majority Prepares
SNAP(1,2) 1:PRE(persistMajority) 2:PRE(majority)
|
|
Replica mem: --S-------1-------2----------------------------
|
Replica disk: -------------------------[1]------[2]---------
|
x
|
Today: A:1,2
|
|
x
|
^
|
earliest point can ACK
|
Tomorrow: A:1,2
|
Scenario 4 - two _level=persistMajority Prepares
SNAP(1,2) 1:PRE(persistMajority) 2:PRE(persistMajority)
|
|
Replica mem: --S-------1-------2--------------------------
|
Replica disk: ------------------------------[1]------[2]--
|
x x
|
Today: A:1 A:2
|
|
x
|
Tomorrow: A:1,2
|
Scenario 5 - Complete snapshot received after persisted level=persistMajority to disk.
SNAP(1,3) 1:PRE(majority) 2:PRE(persistMajority), 3:PRE(majority)
|
|
Replica mem: --S------1-----2-----------------------3----------
|
Replica disk: -----------------[1]----------[2]----------------
|
x x x
|
Today: A:1 A:2 A:3
|
|
^
|
earliest point can ACK
|
x
|
Tomorrow: A:1,3
|
Scenario 6 - Complete snapshot received before persisted level=persistMajority to disk.
SNAP(1,3) 1:PRE(majority) 2:PRE(persistMajority), 3:PRE(majority)
|
|
Replica mem: --S------1-----2------3------------------------
|
Replica disk: -----------------[1]----------[2]-------------
|
x x
|
Today: A:1 A:2,3
|
^
|
earliest point can ACK
|
x x
|
Tomorrow: A:1 A:2,3
|
Attachments
Issue Links
Gerrit Reviews
For Gerrit Dashboard: MB-34150 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
130568,2 | MB-34150: Assume ItemsToFlush::ranges not empty if at least one item | master | kv_engine | Status: NEW | 0 | +1 |
130637,6 | MB-34150: Add test to show how we update the Persisted Snap Range | master | kv_engine | Status: NEW | 0 | -1 |
130771,4 | MB-34150: Dissect FailoverTable::needsRollback and enhance comments | master | kv_engine | Status: NEW | -1 | +1 |
130779,2 | MB-34150: Fix "empty snapshot" optimization in FT::needsRollback | master | kv_engine | Status: NEW | 0 | +1 |
130786,1 | MB-34150: Remove "complete snapshot" optimization in FT::needsRollback | master | kv_engine | Status: NEW | -1 | -1 |
131015,1 | MB-34150: Add test to show Replica promotion at the middle of a snapshot | master | kv_engine | Status: NEW | 0 | -1 |