[BP 7.2.4] - XDCR doesn't handle a locked CAS from GetMeta

Description

Consider the following scenario:

  1. A LWW XDCR relation from source S to destination D is established, and is performing pessimistic conflict resolution.

  2. A GetLocked operation occurs on the destination Bucket D against doc X

  3. A Mutation X' occurs on the source bucket S while the document is still locked at D (max lock duration is 30s)

  4. Then when XDCR processes the mutation and issues a GetMeta against the destination, as the document is locked it will receive a CAS of 0xffff_ffff_ffff_ffff, and incorrectly consider the remote document newer and not replicate it.

This results in the logically newer mutation X' never being replicated to the destination. (A subsequent mutation X'' would be replicated, as long as it occurs after the lock against X on D has expired - and the document hasn't been locked again).

Potential solutions

I can think of two possible solutions to when XDCR GetMeta finds a locked document:

I'll let comment on these approaches - or perhaps he has a better solution.


comment below:
I’m currently leaning towards (B) as the way to deal with a locked document on the remote bucket.

There are a few reasons for this thought process:

  1. Ideally, we want to respect the target document’s locking mechanism. If a doc is sent optimistically, and in the end it fails a conflict resolution (a locked doc was actually updated and the setWithMeta lost eventually), source XDCR would have lost the ability to track a lost conflict.

  2. Customers are increasingly becoming more aware of conflicts and are getting more interested in learning whatever conflicts that show up. (See https://couchbasecloud.atlassian.net/browse/MB-58989#icft=MB-58989).

  3. For HLV based replication (i.e. XDCR/Mobile https://couchbasecloud.atlassian.net/browse/MB-57921#icft=MB-57921 and https://couchbasecloud.atlassian.net/browse/MB-58989#icft=MB-58989), we are moving towards where source XDCR will perform Set/SetWithMeta with optimistic locking esp since setting and updating HLV should ensure locking takes place.

  4. From performance test during investigation of https://couchbasecloud.atlassian.net/browse/MB-52947#icft=MB-52947 - we know that the performance gain from optimistic replication is nice to have but not a must have. Specifically, weighing between the performance gain vs the visibility or functional gain from doing things pessimistically.

It may be worth start marching in that direction since it is more aligned with the future work that’s to come.

Components

Affects versions

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Activity

Show:

Neil Huang January 3, 2024 at 6:55 PM

Release Notes
Problem Description: When a target document is locked, and a non-optimistic LWW replication is taking place, XDCR will retrieve a "locked CAS" of maxUint. This will cause the source mutation to always lose and lead to scenarios where a source mutation is not replicated even if it should have won conflict resolution.
Resolution: XDCR will retry conflict resolution for the duration of a document that is locked in a pessimistic replication, to ensure that a valid CAS is used for source-side conflict resolution.

Ayush Nayyar December 22, 2023 at 2:09 PM

Verified on 7.2.4-7065.

CB robot November 29, 2023 at 9:05 PM

Build couchbase-server-7.2.4-7028 contains goxdcr commit c032cd1 with commit message:
https://couchbasecloud.atlassian.net/browse/MB-59888#icft=MB-59888: target document locked handling

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

No

Triage

Untriaged

Story Points

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created November 29, 2023 at 5:07 PM
Updated February 6, 2025 at 6:31 PM
Resolved November 29, 2023 at 7:00 PM
Instabug