Details
-
Bug
-
Resolution: Fixed
-
Critical
-
4.6.0
-
Untriaged
-
Yes
Description
Whilst reading a single DCP stream (a vbucket) the CAS can go backwards. This will pose a problem for LWW XDCR configurations when performing an active/passive scenario.
The passive (backup) cluster will be out-of-sync with the active if a new mutation (higher seqno than old version) arrives with a CAS that has gone backwards, the new mutation will be rejected, and now the backup is inconsistent with the active cluster. Reading the document on may return a different version for each cluster, the rev-id conflict resolution mode however is protected against this (hence marking this MB as a regression).
There maybe other effects the CAS going backwards has, but this active/passive issue is the most easy to understand.
This bug occurs because ep-engine assigns each mutation a HLC timestamp (CAS) before the mutation is assigned to the checkpoint (when the mutation is assigned the sequence number). Thus n threads updating the same document can be added into the checkpoint manager with the CAS values out-of-order.
This issue has been proven to occur using a modified ep-engine perfsuite test. Using multiple threads, we write to the same document and have a single DCP client thread monitoring the CAS.