Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55758

[XDCR] Safeguard against CAS poisoning

    XMLWordPrintable

Details

    • 0

    Description

      We have seen situations where CAS values for documents are set incorrectly (i.e. way in the future). When XDCR replications are set up, these source documents are then replicated to targets.

      On the target side, the design is such that when target VB receives a document with a "later-in-time" CAS, it would fast-forward the VB's internal CAS to said value. When a target receives a document with an incorrect CAS from a source that is way out of bounds, it is called CAS-poisoning.

      Ideally, there should be CAS-poisoning protection from multiple sides:

      1. SDKs or clients should do sanity check for time, if possible.
      2. Source XDCR should be improved to prevent sending documents that has poisoned CAS.
      3. Target KV could be improved to reject documents that have CAS way out of date (out of scope of this ticket)
      4. Mobile clients participating in XDCR/SGW replications (to come in the future)

      This ticket is to track the enhancement to XDCR to design and implement such a protection mechanism.

      The mechanism should involve:

      1. Ability to turn on/off such a protection
      2. Ability to set "how far ahead in the future is considered out of bounds" for mutations coming from DCP, and to skip replicating them
      3. Have [persisted] counters for these mutations as part of the replication stats (for debugging)
      4. (Optional) Ability to log when CAS is poisoned: This could be done after logging mechanism is implemented (MB-15561), as logging is a problem in itself.

      One thing to note that an unintended consequence of preventing CAS poisoning means that HA could be problematic. As in, the document that has an incorrect CAS set will be filtered out and thus will not have exist on the HA site.

      (From one-pager)
      Behavior
      When a document is considered CAS poisoned, it will be filtered out and not replicated. XDCR will consider the document “CAS_poisoned” and will account for it in a counter stat. XDCR will move on and continue to replicate other documents that are tagged with later sequence numbers in the interest HA and keep the pipeline moving. This is different from guardrail in the traditional sense, where guardrail will temporarily halt all operations until the error is fixed.
      Information Persistence
      The CAS-poisoned counter will be persisted in XDCR checkpoints and live in perpetuity until reset. This is to ensure that bread crumbs are present if customers indicate missing data on the target bucket. In the interest of scalability, XDCR will not retain the document keys of the offending documents.
      Notification Methods
      When XDCR has seen at least one CAS-poisoned document, or when XDCR resumes from checkpoint that happen to have persisted a counter indicating CAS-poisoned document, it will notify the users in the following ways:
      An UI error/message will be shown on the console. This will also be present in Capella.
      Prometheus counter stat will have a “cas_poisoned” (TBD) counter that shows a value > 0.
      In the XDCR logs, the counter will be non-0.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-55758
          # Subject Branch Project Status CR V

          Activity

            People

              ayush.nayyar Ayush Nayyar
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty