Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
4.6.0
-
Untriaged
-
No
Description
Our HLC is incorrect, it skips a significant part of the algorithm as specified in the academic paper and without change, we don't actually perform 'LWW' in some time-sync scenarios.
In our implementation, ep-engine only performs the sending/local event portion of the HLC algorithm.
When a mutation occurs we generate a new HLC and assign that as the CAS, effectively we do:
// This code is uint64_t VBucket::nextHLCCas()
|
HLC = max(maxHLC, clock()); // ignoring the bit mask part
|
if (HLC > maxCAS)
|
maxCAS = HLC
|
else
|
HLC++
|
maxCAS=HLC
|
fi
|
mutation.CAS = HLC
|
On the receive side (i.e. setWithMeta - SWM) we do very little.
- We perform LWW conflict resolution using the CAS of existing doc and the CAS from the SWM.
- If the document is not rejected do:
// This is void VBucket::setMaxCas(uint64_t newCas)
|
if (CAS > maxCAS)
|
maxCAS = CAS.
|
Thus we are broken in following scenario.
Consider three nodes, A, B and C.
- A replicates to B
- C replicates to B
- Thus B performs conflict resolution
- Assume that A has a local clock ahead of B/C.
- Assume that B and C are in sync or very close.
Next the very first mutation(key=K) occurs against node A, so A will perform a setWithMeta(K) against B.
In our implementation of HLC, when A performs the first SWM to B, and the document is accepted we only do:
// setMaxCas
|
if (CAS > maxHLC)
|
maxHLC = CAS.
|
So we have accepted the first SWM (correctly) and store a document with a CAS generated by node A.
However, any SWM that now occurs from C to B will be rejected as it is generating time-stamps using its local clock, and that clock is behind A. Until the clock of C catches up with the CAS stored for document K, all of its writes are incorrectly rejected, globally he is the last to write, but our implementation orders A before C.
This issue is illustrated visually on the following diagram. In this diagram, node A is 10 ticks ahead of B/C. When the mutation(K) occurs at A's time of 22, the SWM will force B's maxHLC forward to 22, now every single SWM(K) from C to B will be rejected until C's clock overtakes 22.
However, the HLC algorithm, when implemented properly does not trigger this issue as there is a whole part of the algorithm required when receiving an event that may involve re-stamping the events time with a new HLC.
The paper is linked from here, but this summary page is enough:
The HLC algorithm shows that when a node receives an event, it may need to recompute a new HLC to allow ordering.
In our case when the high clock of A is dominating B, we should be recomputing the time-stamps coming from C using the logical-clock part of the algorithm. Before doing conflict-resolution a new CAS would be created that is +1 of the clock and we then correctly accept that later write.