Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Critical
Fix Version/s: Morpheus
Affects Version/s: None
Component/s: XDCR
Labels:
- documentation
- releasenote

Story Points:
0

Description

We have seen situations where CAS values for documents are set incorrectly (i.e. way in the future). When XDCR replications are set up, these source documents are then replicated to targets.

On the target side, the design is such that when target VB receives a document with a "later-in-time" CAS, it would fast-forward the VB's internal CAS to said value. When a target receives a document with an incorrect CAS from a source that is way out of bounds, it is called CAS-poisoning.

Ideally, there should be CAS-poisoning protection from multiple sides:

SDKs or clients should do sanity check for time, if possible.
Source XDCR should be improved to prevent sending documents that has poisoned CAS.
Target KV could be improved to reject documents that have CAS way out of date (out of scope of this ticket)
Mobile clients participating in XDCR/SGW replications (to come in the future)

This ticket is to track the enhancement to XDCR to design and implement such a protection mechanism.

The mechanism should involve:

Ability to turn on/off such a protection
Ability to set "how far ahead in the future is considered out of bounds" for mutations coming from DCP, and to skip replicating them
Have [persisted] counters for these mutations as part of the replication stats (for debugging)
(Optional) Ability to log when CAS is poisoned: This could be done after logging mechanism is implemented (MB-15561), as logging is a problem in itself.

One thing to note that an unintended consequence of preventing CAS poisoning means that HA could be problematic. As in, the document that has an incorrect CAS set will be filtered out and thus will not have exist on the HA site.

(From one-pager)
Behavior
When a document is considered CAS poisoned, it will be filtered out and not replicated. XDCR will consider the document “CAS_poisoned” and will account for it in a counter stat. XDCR will move on and continue to replicate other documents that are tagged with later sequence numbers in the interest HA and keep the pipeline moving. This is different from guardrail in the traditional sense, where guardrail will temporarily halt all operations until the error is fixed.
Information Persistence
The CAS-poisoned counter will be persisted in XDCR checkpoints and live in perpetuity until reset. This is to ensure that bread crumbs are present if customers indicate missing data on the target bucket. In the interest of scalability, XDCR will not retain the document keys of the offending documents.
Notification Methods
When XDCR has seen at least one CAS-poisoned document, or when XDCR resumes from checkpoint that happen to have persisted a counter indicating CAS-poisoned document, it will notify the users in the following ways:
An UI error/message will be shown on the console. This will also be present in Capella.
Prometheus counter stat will have a “cas_poisoned” (TBD) counter that shows a value > 0.
In the XDCR logs, the counter will be non-0.

Attachments

Issue Links

backports to

MB-62383 [BP 7.6.3] - [XDCR] Safeguard against CAS poisoning

Resolved

MB-62384 [BP 7.2.6] - [XDCR] Safeguard against CAS poisoning

Resolved

causes

MB-62897 XDCR - Deadlock in preCheckCasCheck

Open

relates to

MB-57353 XDCR RAS

Open

MB-61385 Limit vbucket "max_cas"

In Progress

MB-62034 XDCR - Handle target KV CAS poisoning protection

Reopened

MB-55804 Bucket creation should default to CAS based CR

Open

links to

One-Pager

(2 relates to, 1 links to)

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-55758
#	Subject	Branch	Project	Status	CR	V
204313,11	MB-55758: cas poisoning live pipeline guardrail	master	goxdcr	Status: MERGED	+2	+1
204461,19	MB-55758: Pre-replication CAS poison check	master	goxdcr	Status: MERGED	+2	+1
211538,4	MB-55758: change live guardrail from hours to secs	master	goxdcr	Status: MERGED	+2	+1
212171,2	MB-55758: Restore changes from Gerrit change 210073	master	goxdcr	Status: MERGED	+2	+1
212521,2	MB-55758: fix unit test due to change from hour to sec	master	goxdcr	Status: MERGED	+2	+1
212542,1	MB-55758: fix unit test	master	goxdcr	Status: ABANDONED	0	0