Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60346

XDCR - Use Set instead of SetWithMeta to prevent CAS rollback in specific mobile scenario

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Morpheus, 7.6.2
    • Morpheus
    • XDCR
    • None

    Description

      Let say there are 2 clusters A and B.  

      Cluster A Document:
      Cas 100
       
      Cluster B Document:
      Cas 100
      

      The document is mutated by application in cluster A (source) while the same document is mutated by import process in cluster B (target).  

      Cluster A Document: 
      Cas: 110
       
      Cluster B Document:	
      Cas: 120
      cvCas: 100
      ImportCas: 120
      _sync: <xyz>
      

      Based on the spec, XDCR should use CAS in A and cvCAS in B to perform conflict resolution.  

      CAS(A) > cvCAS(B), so the mutation is replicated to cluster B.   
      

      However CAS(A) < CAS(B).

      XDCR code Currently:

      • It’ll do the get/getmeta as necessary to retrieve HLV, etc.
      • It will preserve the HLV/_sync as necessary and at the same time, update/prune the HLV (xmem.updateSystemXattrForTarget)
      • If the target document exists, ensure that setWithMeta has CAS locking (i.e. setWithMeta will only succeed if target CAS hasn’t changed since step 1)
      • If target doc w/ HLV exists, it will ensure flag SKIP_CONFLICT_RESOLUTION_FLAG is set (https://github.com/couchbase/kv_engine/blob/master/engines/ep/docs/protocol/set_with_meta.md#skip_conflict_resolution_flag) (xmem.setSkipTargetCR)
      • XDCR will proceed to issue the SetWithMeta command

      Where KV team says:

      A setWithMeta with SKIP_CONFLICT_RESOLUTION_FLAG will result in the input CAS (extras.cas) being used in the resulting update (so yes will rollback)

      XDCR will set the extra’s CAS to the DCP event’s CAS https://github.com/couchbase/goxdcr/blob/c425a70f87d525b0ed207abc27460b581eefea17/parts/router.go#L1539
      Because the local mutation “wins”, XDCR will also skip composing importCAS as part of the setWithMeta, since the source document winning is not considered an import document
      XDCR will set the cvCas to == CAS to indicate an up-to-date HLV.

      SetWithMeta will succeed, resulting in the following situation:

      Cluster A Document: 
      Cas: 110
       
      Cluster B Document:	
      Cas: 110
      cvCas: 110
      __sync: <preserved from Cas 120 document>
      

      This behavior can lead to two interesting cases on the mobile side:

      Firstly:

      The new write on cluster A (cas 110) hasn't been imported on cluster B, so will trigger an import. We compare _sync.cas to document CAS to determine whether the document needs to be imported. In this case _sync.cas on cluster B will be 120 and a mismatch.

      Secondly:

      However, I think there's potentially an interesting case where the CAS of the mutation on cluster A is identical to the CAS of the import on cluster B. This would look like:

      Cluster A Document: 
      Cas: 110
       
      Cluster B Document:	
      Cas: 110
      cvCas: 100
      ImportCas: 110
      _sync: {cas:110}
      

      Here XDCR will copy over the cluster A document with Cas 110, but Sync Gateway will not import because the CAS matches _sync.cas, which is a problem.

      Solution
      To solve this in the interest for both mobile and XDCR, we should
      1 - prevent the write to roll back the CAS on the target cluster by allowing KV to re-generate a new CAS as part of the write
      2 - Utilize macro expansion (not present on SetWithMeta) to update the cvCAS to match the regenerated CAS (without changing src/vrs), via macro expansion on the remote KV.

      Using Set with Macro expansion will lead to:

      Cluster A Document: 
      Cas: 110
       
      Cluster B Document:	
      Cas: 130 (auto generated)
      cvCas: 130 (macro expansion)
      __sync: <preserved from Cas 120 document>
      

      From here, there are two additional operations:
      Scenario 1
      1. Mobile Import will take place because the _sync property was associated with CAS of 120, which is different from the document CAS of 130.

      Cluster A Document: 
      Cas: 110
       
      Cluster B Document:	
      Cas: 140
      cvCas: 130
      importCas: 140
      __sync: <associates with CAS 140>
      

      2. This will lead to one more document write from B to A, because document A has no importCas, leading to CAS convergence:

      Cluster A Document: 
      Cas: 140
      cvCas: 140
       
      Cluster B Document:	
      Cas: 140
      cvCas: 130
      importCas: 140
      __sync: <associates with CAS 140>
      

      If the mobile import takes place after the CAS convergence, the order will be different, leading to different detail but essentially the same number of ops:
      Scenario 2
      1a. CAS convergence from B to A by XDCR

      Cluster A Document: 
      Cas: 130
      cvCAS: 130
       
      Cluster B Document:	
      Cas: 130
      cvCas: 130
      __sync: <preserved from Cas 120 document>
      

      2a. Mobile Import, after which leads to a no-op from B to A because XDCR on B will decide cvCas(B) == CAS(A)

      Cluster A Document: 
      Cas: 130
      cvCAS: 130
       
      Cluster B Document:	
      Cas: 150
      cvCas: 130
      importCas: 150
      __sync: <associates with CAS 150>
      

      Side Notes
      XDCR will need to issue a "SET" command instead of a regular "SET_WITH_META" command on this particular case. The Set command should use optimistic locking and only succeed if the target document has not changed.
      This will lead to minute accounting differences. For example, support would often times use the number of "set with meta" ops to count the number if writes from a XDCR source. This would cause the stats to skew.
      One more write will take place to perform CAS convergence because rollback is no longer in place

      (Private Source Convo: https://couchbase.slack.com/archives/C05CG8PMNDS/p1704905875115799)

      Attachments

        For Gerrit Dashboard: MB-60346
        # Subject Branch Project Status CR V

        Activity

          People

            sumukh.bhat Sumukh Bhat
            neil.huang Neil Huang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              PagerDuty