Description
One-Line Summary
According to Eventing design doc, (MB-50944), upon every "SBM" write, it stores the eventing ID inside a document. This metadata will cause XDCR to ping pong and never stop replicating in an active-active deployment when both clusters have its own eventing functions with its own eventing "fiid".
Eventing Doc
Eventing/SGW Support Doc: https://docs.google.com/document/d/1KfgW6SqETp_vviCjRAklmYKx6mBoi2LqWBo92L9w-Hg/edit#heading=h.lcogjmc7wx92
Issue
Eventing's SBM field contains a “fiid” field as well as a Eventing.PCAS indicating the UUID of an eventing “actor” and the version that the actor performed eventing on the document.
Eventing will thus also set the decorated document back to the source bucket. This event will trigger a mutation down from DCP.
The fiid field and the Eventing.PCAS pair is meant to act as a check. Eventing will determine that the new mutation (that has been set back by itself in the first pass) is essentially a no-op and will not further perform another decoration.
See: https://docs.google.com/document/d/1KfgW6SqETp_vviCjRAklmYKx6mBoi2LqWBo92L9w-Hg/edit#heading=h.4i8zi04h8fwb
The statement:
if (xattr.hasOwnProperty('_eventing') && xattr['_eventing'].fiid == current_fiid) {
|
// give priority to matching cas if it exists
|
if (xattr['_eventing'].hasOwnProperty('cas')) {
|
return xattr['_eventing']['cas'] == meta.cas;
|
} else {
|
return xattr['_eventing']['seq'] == meta.seq;
|
}
|
}
|
However, this check fails to take into account that XDCR now needs to replicate the change to other bucket/clusters in the topologies.
If each of the target cluster bucket also has eventing running, and by design, the other eventing actors do not share the same fiid, it will lead to infinite ping-pong.
The main issue that causes this is because the eventing Xattr is one-dimensional and does not contain the ability to record causality between multiple eventing actors.
See the example below:
C1 has Eventing running, with eventing function ID “ec1”
C2 has Eventing running, with eventing function ID “ec2”
SDK Writes Doc A
C1:
|
—
|
CAS: 100
|
C1 XDCR replicates to C2
C2:
|
—
|
CAS: 100
|
CvCAS: 100
|
C2 eventing fires
C2:
|
—
|
CAS: 150
|
CvCAS: 100
|
Eventing.CAS: 150
|
Eventing.PCAS: 100
|
Eventing.fiid: “ec2”
|
Eventing on C1 sees that the document is not handled.
C1:
|
——
|
CAS: 120
|
CvCAS: 100
|
Eventing.CAS: 120
|
Eventing.PCAS: 100
|
Eventing.fiid: “ec1”
|
XDCR C1 loses (cas 120 < cas 150)
XDCR C2 Wins (cas 150 > cas 120)
Compose HLV, sends
Doc received on C1 from C2:
C1
|
—
|
CAS: 150
|
CvCAS: 150
|
Eventing.CAS: 150
|
Eventing.PCAS 100
|
Eventing.fiid: “ec2”
|
fiid of Cluster 1 is “ec1”, eventing will re-run due to fiid mismatch, and tag the fiid to “ec1”:
C1
|
—
|
CAS: 170
|
CvCAS: 150
|
Eventing.CAS: 170
|
Eventing.PCAS: 150
|
Eventing.fiid: “ec1”
|
XDCR C1 wins over C2 (Cas 170 > Cas 150)
C1 sends the doc over to C2:
C2
|
—
|
CAS: 170
|
CvCAS: 170
|
eventing.CAS: 170
|
eventing.PCAS: 150
|
eventing.fiid: “ec1”
|
The fiid no longer matches to the C2’s fiid, and C2 eventing will fire, and XDCR will replicate from C2 to C1
<Repeat>
Attachments
Issue Links
- is parent task of
-
DOC-12127 Doc for Eventing/SGW co-existence design incompatible with bi-directional XDCR
- Open