Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61065

Eventing/SGW co-existence design incompatible with bi-directional XDCR

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Critical
    • Morpheus, 7.6.2
    • Morpheus
    • eventing
    • 0

    Description

      One-Line Summary
      According to Eventing design doc, (MB-50944), upon every "SBM" write, it stores the eventing ID inside a document. This metadata will cause XDCR to ping pong and never stop replicating in an active-active deployment when both clusters have its own eventing functions with its own eventing "fiid".

      Eventing Doc

      Eventing/SGW Support Doc: https://docs.google.com/document/d/1KfgW6SqETp_vviCjRAklmYKx6mBoi2LqWBo92L9w-Hg/edit#heading=h.lcogjmc7wx92

      Issue

      Eventing's SBM field contains a “fiid” field as well as a Eventing.PCAS indicating the UUID of an eventing “actor” and the version that the actor performed eventing on the document.
      Eventing will thus also set the decorated document back to the source bucket. This event will trigger a mutation down from DCP.
      The fiid field and the Eventing.PCAS pair is meant to act as a check. Eventing will determine that the new mutation (that has been set back by itself in the first pass) is essentially a no-op and will not further perform another decoration.

      See: https://docs.google.com/document/d/1KfgW6SqETp_vviCjRAklmYKx6mBoi2LqWBo92L9w-Hg/edit#heading=h.4i8zi04h8fwb
      The statement:

      if (xattr.hasOwnProperty('_eventing') && xattr['_eventing'].fiid == current_fiid) {
          // give priority to matching cas if it exists
          if (xattr['_eventing'].hasOwnProperty('cas')) {
              return xattr['_eventing']['cas'] == meta.cas;
          } else {
              return xattr['_eventing']['seq'] == meta.seq;
          }
      }

      However, this check fails to take into account that XDCR now needs to replicate the change to other bucket/clusters in the topologies.
      If each of the target cluster bucket also has eventing running, and by design, the other eventing actors do not share the same fiid, it will lead to infinite ping-pong.

      The main issue that causes this is because the eventing Xattr is one-dimensional and does not contain the ability to record causality between multiple eventing actors.

      See the example below:

      C1 has Eventing running, with eventing function ID “ec1”
      C2 has Eventing running, with eventing function ID “ec2”

      SDK Writes Doc A

      C1:
      CAS: 100
      

      C1 XDCR replicates to C2

      			C2:
      			CAS: 100
      			CvCAS: 100
      

      C2 eventing fires

      			C2:
      			CAS: 150
      			CvCAS: 100
      			Eventing.CAS: 150
      			Eventing.PCAS: 100
      			Eventing.fiid: “ec2”
      

      Eventing on C1 sees that the document is not handled.

      C1:
      ——
      CAS: 120
      CvCAS: 100
      Eventing.CAS: 120
      Eventing.PCAS: 100
      Eventing.fiid: “ec1”
      

      XDCR C1 loses (cas 120 < cas 150)

      XDCR C2 Wins (cas 150 > cas 120)
      Compose HLV, sends

      Doc received on C1 from C2:

      C1
      CAS: 150
      CvCAS: 150
      Eventing.CAS: 150
      Eventing.PCAS 100
      Eventing.fiid: “ec2”
      

      fiid of Cluster 1 is “ec1”, eventing will re-run due to fiid mismatch, and tag the fiid to “ec1”:

      C1
      CAS: 170
      CvCAS: 150
      Eventing.CAS: 170
      Eventing.PCAS: 150
      Eventing.fiid: “ec1”
      

      XDCR C1 wins over C2 (Cas 170 > Cas 150)

      C1 sends the doc over to C2:

      			C2
      			CAS: 170
      			CvCAS: 170
      			eventing.CAS: 170
      			eventing.PCAS: 150
      			eventing.fiid: “ec1”
      

      The fiid no longer matches to the C2’s fiid, and C2 eventing will fire, and XDCR will replicate from C2 to C1
      <Repeat>

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              abhishek.jindal Abhishek Jindal
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty