Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-34150

Avoid loss of prepared SyncWrites after failover + rollback if replicas>2

    XMLWordPrintable

Details

    Description

      Summary

      Committed SyncWrites may be lost in a multi-failure scenario if the second failure occurs just after a rollback.

      Status (As of 29th May 2019):

      From Paolo Cocchi's comment below:

      Status

      HPS moves at snapshot-boundaries in PassiveDM (MB-34197 closed):

      1. Level Majority and MajorityAndPersistOnMaster Prepares satisfied (up to the durability-fence, if any) only after the complete snapshot is received on PassiveStream
      2. Level PersistToMajority Prepares satisfied only after the complete snapshot is persisted

      The next step is to ensure that at Replica promotion "KV Engine sets the branch point as the endpoint of the highest snapshot received" (see the MB description).

      Currently we are backing-off on proceeding with this as we have found further (possible) issues on the Compaction side (https://docs.google.com/document/d/1tdWCh2KgkFEm1e0pgu_wuupSnRb4xIGsv92CLmVIP8M/edit).

      Details

      Currently (v6.0), when a vBucket is failed over and a replica promoted to active, a failover table branch is created at the highest complete persisted snapshot. Any other replica which re-connects to this new active will be required to rollback to this failover branch point, however if the new active also fails just after the replica has rolled back past (and discard) a Prepared SyncWrite, that sync write could be lost.

      Consider the following scenario, with 4 nodes (3 replicas), majority=3.

      Setup

      • Active node N0 has snapshots [1:SET, 2:SET], [3:PREPARE(x, level=majority)], [4:COMMIT(x)], vb UUID=AAAA. No mutations have yet been persisted to disk.
      • Replicas N1 & N2 have both received up to seqno 3; but only persisted up to seqno 2.
      • Replica N3 has not received anything yet.

      Scenario

      1. Active crashes, ns_server triggers auto-failover.
      2. ns_server selects N1 as the new active (N2 is identical, it could also select that one, doesn't matter for this scenario).
      3. Upon promotion to active, N1 creates a new failover table at highest (complete) persisted snapshot (which is 2) with a new vb UUID = BBBB.
      4. N2 re-connects to the new active, and performs an ADD_STREAM negotiation. Given the seqnos diverge at 2, N2 will be told to rollback to seqno=2.
      5. N2 issues the rollback, discarding 3:PREPARE.
      6. N1 crashes, ns_server triggers auto-failover.
      7. ns_server selects N2 as new active (higher of the two N2 & N3), however 3:PREPARE has been lost due to rollback, breaking the SyncWrite contract.

      (Full details can be found under Q: Replicas Rolling Back to Failover Branch Points Can Lead to Data Loss in the design doc.)

      Solution

      To avoid this problem the following changes are needed:

      1. In addition to other criteria governing acknowledgement of durable writes, replicas should only acknowledge durable writes once the last mutation in a snapshot has been received by the replica.
      2. As a consequence of (1), the high_prepared_seqno will always be at snapshot boundaries (as we don't consider any Prepare satisfied until we have a complete snapshot).
      3. KV Engine should set the branch point as the endpoint of the highest snapshot it’s aware of - not necessarily the persisted snapshot.
      4. As consequence of point (3),at complete-snapshot received we can move the HPS onto Level Majority and MajorityAndPersistOnMaster Prepares up to the durability-fence (if any), but we must wait for the complete-snapshot being persisted for the HPS covering PersistToMajority Prepares. Locally-satisfied PersistToMajority Prepares could be rolled-back in certain failure scenarios otherwise (similar to the example described above).

      Example Scenarios for HPS updates (points 1, 2 and 4 above)

      Scenario 1 - Two level=majority Prepares

      SNAP(1,2) 1:PRE(majority) 2:PRE(majority)
       
      Replica mem: --S-------1-------2---------------------------
                             x       x
      Today:                 A:1     A:2
       
                                     x
      Tomorrow:                      A:1,2
      

      Scenario 2 - Level=majority, level=persistMajority Prepares

      SNAP(1,2) 1:PRE(majority) 2:PRE(persistMajority)
       
      Replica mem: --S-------1-------2--------------------------
      Replica disk: ------------------------------[1]------[2]--
                             x                              x
      Today:                 A:1                            A:2
       
                                     x                      x
      Tomorrow:                      A:1                    A:2
      

      Scenario 3 - Level=persistMajority, level=majority Prepares

      SNAP(1,2) 1:PRE(persistMajority) 2:PRE(majority)
       
      Replica mem: --S-------1-------2----------------------------
      Replica disk: -------------------------[1]------[2]---------
                                              x
      Today:                                  A:1,2
       
                                                       x
                                     ^
                earliest point can ACK
      Tomorrow:                                        A:1,2
      

      Scenario 4 - two _level=persistMajority Prepares

      SNAP(1,2) 1:PRE(persistMajority) 2:PRE(persistMajority)
       
      Replica mem: --S-------1-------2--------------------------
      Replica disk: ------------------------------[1]------[2]--
                                                   x        x
      Today:                                       A:1      A:2
       
                                                            x
      Tomorrow:                                             A:1,2
      

      Scenario 5 - Complete snapshot received after persisted level=persistMajority to disk.

      SNAP(1,3) 1:PRE(majority) 2:PRE(persistMajority), 3:PRE(majority)
       
      Replica mem: --S------1-----2-----------------------3----------
      Replica disk: -----------------[1]----------[2]----------------
                            x                      x      x
      Today:                A:1                    A:2    A:3
       
                                                          ^
                                     earliest point can ACK
                                                          x
      Tomorrow:                                           A:1,3
      

      Scenario 6 - Complete snapshot received before persisted level=persistMajority to disk.

      SNAP(1,3) 1:PRE(majority) 2:PRE(persistMajority), 3:PRE(majority)
       
      Replica mem: --S------1-----2------3------------------------
      Replica disk: -----------------[1]----------[2]-------------
                            x                      x
      Today:                A:1                    A:2,3
                                         ^
                    earliest point can ACK
                                         x         x
      Tomorrow:                          A:1       A:2,3
      

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-34150
          # Subject Branch Project Status CR V

          Activity

            People

              paolo.cocchi Paolo Cocchi
              drigby Dave Rigby (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:

                PagerDuty