Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60733

XDCR - Failed Target CR with HLV can lead to missed event handling

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • Morpheus, 7.6.2
    • Morpheus, 7.6.2
    • XDCR
    • None

    Description

       

      Consider the following setup:

       

      1. 2 clusters with 2 buckets set up for bi-directional replication. Buckets are empty. (C1/B1 <-> C2/B2)
      2. “enableCrossClusterVersion” is set up, meaning that XDCR will stamp HLV. Mobile is active.
      3. A mutation is written onto C1/B1, with CAS of 1.
      4. XDCR will stamp HLV to write to C2/B2.
      5. Because C2/B2 does not have the document, Cas “locking” will not be used.
        https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/base/types.go#L826
      6. Because C2/B2 does not have the document, “NoTargetCR” will be set to “false”
        https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/parts/xmem_nozzle.go#L1273
      7. Because NoTargetCR is false, XDCR will use “SetWithMeta”
        https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/base/types.go#L861
      8. The document in C1/B1 to be sent to the target is now in memory and composed of:
        SetWithMeta
        HLV of Source: B1 Version: 1
        CasLocking: false
        CAS to set: 1
      9. Before the packet is sent over the wire, at this point in time, a document is created by a client on C2/B2, with a CAS of 5.
      10. XDCR issues the SetWithMeta of C1/B1. Because NoTargetCR is set to false, target will perform CR and loses, returns EEXISTS.
        https://github.com/couchbase/kv_engine/blob/master/engines/ep/docs/protocol/set_with_meta.md
      11. C1 XDCR will go to receiveResponse given the EEXIST error
        https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/parts/xmem_nozzle.go#L2510
      12. It is not going to be a locking request, because:
        SET_WITH_META is being used
        CAS is set to 0 (casLocking was false)
      13. This situation is not handled, with the comment on https://github.com/couchbase/goxdcr/blob/46103282dad323de6529ba6eb27a03171967068f/parts/xmem_nozzle.go#L2535 indicating that “we can ignore the error”

      The error should not be ignored, because it means XDCR lost track of a single failed CR event. Thus, it would end up with 1 mutation received from DCP, but 0 sent, and 0 CR failed in terms of stats.

      What’s worse is that the lack of emitting event would mean that the through seqno tracker service will not have known what to do, leading to a non-0 changes_left. And, throughSeqnoTrackerSvc would not be able to truncate, with the list growing.

       

      Expected Behavior:

      • The correct handling here is to log the conflict as a lost conflict resolution, and not retry.
      • An proper event should be raised so that throughSeqnoTrackerSvc knows how to move on.

       

      Future Work:

      This fix is needed for True Conflict logging. We need to ensure that this case is captured so that this is logged as part of a true conflict work.

      Side Note: The example in XDCR/Mobile document will possibly also lead to this situation.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            sumukh.bhat Sumukh Bhat
            neil.huang Neil Huang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty