Fixed
Pinned fields
Click on the next to a field label to start pinning.
Details
Assignee
Ayush NayyarAyush NayyarReporter
Neil HuangNeil HuangIs this a Regression?
NoTriage
UntriagedStory Points
0Priority
CriticalInstabug
Open Instabug
Details
Details
Assignee
Ayush Nayyar
Ayush NayyarReporter
Neil Huang
Neil HuangIs this a Regression?
No
Triage
Untriaged
Story Points
0
Priority
Instabug
Open Instabug
PagerDuty
PagerDuty
PagerDuty
Sentry
Sentry
Sentry
Zendesk Support
Zendesk Support
Zendesk Support
Created November 29, 2023 at 5:07 PM
Updated February 6, 2025 at 6:31 PM
Resolved November 29, 2023 at 7:00 PM
Consider the following scenario:
A LWW XDCR relation from source
S
to destinationD
is established, and is performing pessimistic conflict resolution.A
GetLocked
operation occurs on the destination BucketD
against docX
A Mutation
X'
occurs on the source bucketS
while the document is still locked atD
(max lock duration is 30s)Then when XDCR processes the mutation and issues a
GetMeta
against the destination, as the document is locked it will receive a CAS of0xffff_ffff_ffff_ffff
, and incorrectly consider the remote document newer and not replicate it.This results in the logically newer mutation
X'
never being replicated to the destination. (A subsequent mutationX''
would be replicated, as long as it occurs after the lock againstX
onD
has expired - and the document hasn't been locked again).Potential solutions
I can think of two possible solutions to when XDCR GetMeta finds a locked document:
(A) Fallback to optimistic replication - KV-Engine internally "knows" the underlying CAS value and will resolve the conflict correctly (once https://couchbasecloud.atlassian.net/browse/MB-59746#icft=MB-59746 is addressed also)
(B) Retry until the document is unlocked
I'll let @Neil Huang comment on these approaches - or perhaps he has a better solution.
@Neil Huang comment below:
I’m currently leaning towards (B) as the way to deal with a locked document on the remote bucket.
There are a few reasons for this thought process:
Ideally, we want to respect the target document’s locking mechanism. If a doc is sent optimistically, and in the end it fails a conflict resolution (a locked doc was actually updated and the setWithMeta lost eventually), source XDCR would have lost the ability to track a lost conflict.
Customers are increasingly becoming more aware of conflicts and are getting more interested in learning whatever conflicts that show up. (See https://couchbasecloud.atlassian.net/browse/MB-58989#icft=MB-58989).
For HLV based replication (i.e. XDCR/Mobile https://couchbasecloud.atlassian.net/browse/MB-57921#icft=MB-57921 and https://couchbasecloud.atlassian.net/browse/MB-58989#icft=MB-58989), we are moving towards where source XDCR will perform Set/SetWithMeta with optimistic locking esp since setting and updating HLV should ensure locking takes place.
From performance test during investigation of https://couchbasecloud.atlassian.net/browse/MB-52947#icft=MB-52947 - we know that the performance gain from optimistic replication is nice to have but not a must have. Specifically, weighing between the performance gain vs the visibility or functional gain from doing things pessimistically.
It may be worth start marching in that direction since it is more aligned with the future work that’s to come.