Details
-
Bug
-
Resolution: Unresolved
-
Major
-
6.5.2, 6.6.4, 7.0.3, 7.1.0
-
Triaged
-
1
-
No
Description
Touched upon as part of MB-51639, the current way we deal with the HPS for disk snapshots is not ideal and can cause us to roll back potentially substantial amounts of non-durable writes that we don't necessarily need to.
When choosing which replica to promote to active during a failover, ns_server first checks the HPS values of all replicas and selects the nodes with the highest values. The high seqno is then checked to narrow down the set of nodes eligible for promotion. When a replica receives a disk checkpoint (backfill) completed prepares are not included in the snapshot but the HPS must be moved to ensure that this node is eligible for promotion if it has the most data. Currently the HPS value is set to the snapshot end which is correct with respect to our guarantees around data consistency only applying to durable writes (which move the HPS) but may cause unnecessary roll backs (and loss of non-durable writes) due to poor replica selection in the following scenario:
- Create cluster with 2 replicas
- Perform SyncWrite and ensure it is on both replicas (HPS = 1 on both relicas, high seqno = 2)
- Swap rebalance one of the replicas (new replica has HPS = 2 high seqno = 2, existing replica has HPS = 1)
- Perform normal writes that only make it to the existing replica (new replicas has HPS = 2 high seqno = 2, existing replica has HPS = 1 high seqno = 1000)
- Failover the active and due to the HPS values the new replica with HPS = 2 high seqno = 2 is promoted over the existing replica with HPS = 1 high seqno = 1000 which is logically ahead
- Existing replica reconnects to new active and rolls back the writes from seqnos 2 to 1000
This issue occurs because we always choose the replica with higher HPS value, and the high seqno is only used as a tie-breaker. That behaviour alone is fine, but non-ideal when combined with how we currently deal with the HPS value for disk snapshots.
Potential solution:
Send a HPS value in disk snapshot markers similarly to the HCS value that we currently send to propagate a "correct" value around the cluster.