Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59746

ep-engine doesn't restore CAS after document is unlocked, causing LWW to chose wrong "winner"

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • No
    • KV 2023-4

    Description

      Recall that when a document is locked via GetLocked, KV-Engine generates a new "locked CAS", stores it in memory, and returns it to the client for it to use in a subsequent Mutation(CAS=locked_CAS) or Unlock() operation. Any other attempts to read the CAS while the document is locked (Get(), GetMeta() etc) will return a special invalid CAS, encoded as all ones ({{0xffff_ffff_ffff_ffff}). See https://docs.couchbase.com/java-sdk/current/howtos/concurrent-document-mutations.html#pessimistic-locking for more details).

      However, when the document is unlocked without modification (lock times out, or an explicit Unlock() call is made), the previous CAS value is not restored. As such, the in-memory representation of the document has an artificially higher CAS than the document "really" has.

      When using LWW XDCR this is problematic; as it means that a document could have an artificially higher CAS than it "really" has, and hence conflict resolution could incorrectly fail to update the document. Consider the following scenario:

      1. A LWW XDCR relation from source S to destination D is established, and is performing optimistic conflict resolution.
      2. A Mutation operation occurs on the source Bucket {S}

        against the doc X, at time T=0.

      3. A GetLocked operation occurs on the destination Bucket D against the same doc, at a slightly later time - T=5.
      4. Due to XDCR processing lag / network latency, the SetWithMeta() arrives at D after the document has been locked - at T=10. KV-Engine's conflict resolver will compare the (locked) CAS at the destination (5) with the incoming CAS (0), and decide the destination should win - and the update will not be applied.

      (Note there's a variant of this issue also with pessimistic conflict resolution, if say the document was unlocked at T=6 - the document at D will will have a CAS of 5, and hence the GetMeta() call will return 5, and XDCR's pessimistic conflict resolution will decide that the destination document is newer and not issue the SetWithMeta).

      Outside the context of LWW XDCR, I believe the artificially higher CAS is benign - the CAS is only really used for optimistic / pessimistic locking, and they don't really care what the absolute CAS value is, just that requests specifying a CAS match what is in-memory on the node. The only issue I can think of is that one could see spurious optimistic locking failures for full-eviction buckets - if a document which had GetLocked called on it with a matching mutation (and hence CAS was advanced), then advanced CAS was read via Get() for optimistic locking, then document was evicted, then a subsequent SetWithCAS() operation would spuriously fail as the re-fetched document from disk would have a lower CAS.

      Solution

      • KV-Engine needs to keep a copy of the original (pre-lock) CAS, and restore the documents' CAS to this when the document is unlocked.

      The locked_CAS needs to be a unique value for each successful GetLock operation - a sequence of GetLocked(), Unlock(), GetLocked() should generate two different locked_CAS values (even if the document isn't modified), because otherwise
      an actor could accidentally reuse a previous locked_CAS and successfully modify the document, even though that initial lock period has ended.

      As such, KV-Engine currently generates the locked_CAS value by using the Hybrid-Logical Clock, requesting the next HLC value which would be generated (i.e. the current time if HLC in Real mode). This is all fine - however the bug occurs due to the fact that we re-use the original CAS field to store the generated locked_CAS. I assume this was originally done for space efficiency reasons - recall that CAS-based LWW was added later (CB v4.5.0 IIRC), so at original lock design time there was no consideration for CAS actually representing some time-based component - it was just an arbitrary number.

      It should be simple to add a second locked_CAS field to StoredValue to functionally fix this issue, however that has the downside of increasing the size of per-document in-memory metadata by 8 Bytes. We can look at a more (space) optimised implementation once we have it functional.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-59746
          # Subject Branch Project Status CR V

          Activity

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              drigby Dave Rigby (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  PagerDuty