Details
-
Improvement
-
Resolution: Fixed
-
Critical
-
None
-
0
Description
We have seen situations where CAS values for documents are set incorrectly (i.e. way in the future). When XDCR replications are set up, these source documents are then replicated to targets.
On the target side, the design is such that when target VB receives a document with a "later-in-time" CAS, it would fast-forward the VB's internal CAS to said value. When a target receives a document with an incorrect CAS from a source that is way out of bounds, it is called CAS-poisoning.
Ideally, there should be CAS-poisoning protection from multiple sides:
- SDKs or clients should do sanity check for time, if possible.
- Source XDCR should be improved to prevent sending documents that has poisoned CAS.
- Target KV could be improved to reject documents that have CAS way out of date (out of scope of this ticket)
- Mobile clients participating in XDCR/SGW replications (to come in the future)
This ticket is to track the enhancement to XDCR to design and implement such a protection mechanism.
The mechanism should involve:
- Ability to turn on/off such a protection
- Ability to set "how far ahead in the future is considered out of bounds" for mutations coming from DCP, and to skip replicating them
- Have [persisted] counters for these mutations as part of the replication stats (for debugging)
- (Optional) Ability to log when CAS is poisoned: This could be done after logging mechanism is implemented (MB-15561), as logging is a problem in itself.
One thing to note that an unintended consequence of preventing CAS poisoning means that HA could be problematic. As in, the document that has an incorrect CAS set will be filtered out and thus will not have exist on the HA site.
(From one-pager)
Behavior
When a document is considered CAS poisoned, it will be filtered out and not replicated. XDCR will consider the document “CAS_poisoned” and will account for it in a counter stat. XDCR will move on and continue to replicate other documents that are tagged with later sequence numbers in the interest HA and keep the pipeline moving. This is different from guardrail in the traditional sense, where guardrail will temporarily halt all operations until the error is fixed.
Information Persistence
The CAS-poisoned counter will be persisted in XDCR checkpoints and live in perpetuity until reset. This is to ensure that bread crumbs are present if customers indicate missing data on the target bucket. In the interest of scalability, XDCR will not retain the document keys of the offending documents.
Notification Methods
When XDCR has seen at least one CAS-poisoned document, or when XDCR resumes from checkpoint that happen to have persisted a counter indicating CAS-poisoned document, it will notify the users in the following ways:
An UI error/message will be shown on the console. This will also be present in Capella.
Prometheus counter stat will have a “cas_poisoned” (TBD) counter that shows a value > 0.
In the XDCR logs, the counter will be non-0.
Attachments
Issue Links
- backports to
-
MB-62383 [BP 7.6.3] - [XDCR] Safeguard against CAS poisoning
- Resolved
-
MB-62384 [BP 7.2.6] - [XDCR] Safeguard against CAS poisoning
- Closed
- causes
-
MB-62897 XDCR - Deadlock in preCheckCasCheck
- Resolved
-
MB-63005 XDCR - Disable new pipeline cas poisoning check
- Resolved
-
MB-63010 [BP 7.6.3] - XDCR - Disable new pipeline cas poisoning check
- Resolved
-
MB-63009 [BP 7.2.6] - XDCR - Disable new pipeline cas poisoning check
- Closed
- relates to
-
MB-57353 XDCR RAS
- Open
-
MB-61385 Limit vbucket "max_cas"
- In Progress
-
MB-62034 XDCR - Handle target KV CAS poisoning protection
- Resolved
-
MB-55804 Bucket creation should default to CAS based CR
- Open
- links to