Details
-
Bug
-
Resolution: Not a Bug
-
Major
-
Morpheus, 7.6.2
-
None
-
Untriaged
-
0
-
Yes
Description
Consider the following setup:
- 2 clusters with 2 buckets set up for bi-directional replication. Buckets are empty. (C1/B1 <-> C2/B2)
- “enableCrossClusterVersion” is set up, meaning that XDCR will stamp HLV. Mobile is active.
- A mutation is written onto C1/B1, with CAS of 1.
- XDCR will stamp HLV to write to C2/B2.
- Because C2/B2 does not have the document, Cas “locking” will not be used.
https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/base/types.go#L826 - Because C2/B2 does not have the document, “NoTargetCR” will be set to “false”
https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/parts/xmem_nozzle.go#L1273 - Because NoTargetCR is false, XDCR will use “SetWithMeta”
https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/base/types.go#L861 - The document in C1/B1 to be sent to the target is now in memory and composed of:
SetWithMeta
HLV of Source: B1 Version: 1
CasLocking: false
CAS to set: 1 - Before the packet is sent over the wire, at this point in time, a document is created by a client on C2/B2, with a CAS of 5.
- XDCR issues the SetWithMeta of C1/B1. Because NoTargetCR is set to false, target will perform CR and loses, returns EEXISTS.
https://github.com/couchbase/kv_engine/blob/master/engines/ep/docs/protocol/set_with_meta.md - C1 XDCR will go to receiveResponse given the EEXIST error
https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/parts/xmem_nozzle.go#L2510 - It is not going to be a locking request, because:
SET_WITH_META is being used
CAS is set to 0 (casLocking was false) - This situation is not handled, with the comment on https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/parts/xmem_nozzle.go#L2535 indicating that “we can ignore the error”
The error should not be ignored, because it means XDCR lost track of a single failed CR event. Thus, it would end up with 1 mutation received from DCP, but 0 sent, and 0 CR failed in terms of stats.
What’s worse is that the lack of emitting event would mean that the through seqno tracker service will not have known what to do, leading to a non-0 changes_left. And, throughSeqnoTrackerSvc would not be able to truncate, with the list growing.
Expected Behavior:
- The correct handling here is to log the conflict as a lost conflict resolution, and not retry.
- An proper event should be raised so that throughSeqnoTrackerSvc knows how to move on.
Future Work:
This fix is needed for True Conflict logging. We need to ensure that this case is captured so that this is logged as part of a true conflict work.
Side Note: The example in XDCR/Mobile document will possibly also lead to this situation.