Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59888

[BP 7.2.4] - XDCR doesn't handle a locked CAS from GetMeta

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      Consider the following scenario:

      1. A LWW XDCR relation from source S to destination D is established, and is performing pessimistic conflict resolution.
      2. A GetLocked operation occurs on the destination Bucket D against doc X
      3. A Mutation X' occurs on the source bucket S while the document is still locked at D (max lock duration is 30s)
      4. Then when XDCR processes the mutation and issues a GetMeta against the destination, as the document is locked it will receive a CAS of 0xffff_ffff_ffff_ffff, and incorrectly consider the remote document newer and not replicate it.

      This results in the logically newer mutation X' never being replicated to the destination. (A subsequent mutation X'' would be replicated, as long as it occurs after the lock against X on D has expired - and the document hasn't been locked again).

      Potential solutions

      I can think of two possible solutions to when XDCR GetMeta finds a locked document:

      • (A) Fallback to optimistic replication - KV-Engine internally "knows" the underlying CAS value and will resolve the conflict correctly (once MB-59746 is addressed also)
      • (B) Retry until the document is unlocked

      I'll let Neil Huang comment on these approaches - or perhaps he has a better solution.


      Neil Huang comment below:
      I’m currently leaning towards (B) as the way to deal with a locked document on the remote bucket.

      There are a few reasons for this thought process:

      1. Ideally, we want to respect the target document’s locking mechanism. If a doc is sent optimistically, and in the end it fails a conflict resolution (a locked doc was actually updated and the setWithMeta lost eventually), source XDCR would have lost the ability to track a lost conflict.
      2. Customers are increasingly becoming more aware of conflicts and are getting more interested in learning whatever conflicts that show up. (See MB-58989).
      3. For HLV based replication (i.e. XDCR/Mobile MB-57921 and MB-58989), we are moving towards where source XDCR will perform Set/SetWithMeta with optimistic locking esp since setting and updating HLV should ensure locking takes place.
      4. From performance test during investigation of MB-52947 - we know that the performance gain from optimistic replication is nice to have but not a must have. Specifically, weighing between the performance gain vs the visibility or functional gain from doing things pessimistically.

      It may be worth start marching in that direction since it is more aligned with the future work that’s to come.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ayush.nayyar Ayush Nayyar
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty