We have seen situations where CAS values for documents are set incorrectly (i.e. way in the future). When XDCR replications are set up, these source documents are then replicated to targets.
On the target side, the design is such that when target VB receives a document with a "later-in-time" CAS, it would fast-forward the VB's internal CAS to said value. When a target receives a document with an incorrect CAS from a source that is way out of bounds, it is called CAS-poisoning.
Ideally, there should be CAS-poisoning protection from multiple sides:
- SDKs or clients should do sanity check for time, if possible.
- Source XDCR should be improved to prevent sending documents that has poisoned CAS.
- Target KV could be improved to reject documents that have CAS way out of date (out of scope of this ticket)
- Mobile clients participating in XDCR/SGW replications (to come in the future)
This ticket is to track the enhancement to XDCR to design and implement such a protection mechanism.
The mechanism should involve:
- Ability to turn on/off such a protection
- Ability to set "how far ahead in the future is considered out of bounds" for mutations coming from DCP, and to skip replicating them
- Have [persisted] counters for these mutations as part of the replication stats (for debugging)
- (Optional) Ability to log when CAS is poisoned: This could be done after logging mechanism is implemented (MB-15561), as logging is a problem in itself.
One thing to note that an unintended consequence of preventing CAS poisoning means that HA could be problematic. As in, the document that has an incorrect CAS set will be filtered out and thus will not have exist on the HA site.