Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-25398

Intermittent key not found error during xattr update

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Sync Gateway is intermittently getting a Key Not Found error when trying to update the system xattr on a tombstone. We've got a test that attempts the following for a large number of documents:
      1. Create docs via Sync Gateway (creates doc and xattr)
      2. Update the docs via SDK
      3. SG updates system xattr (import)
      4. Delete the docs via SDK
      5. SG updates system xattr (import)

      Step 5 fails rarely for a single doc, possibly when the system is under load. (e.g. when running this flow for 2000 docs, one doc fails).

      Step 5 is doing a multi-part subdoc operation to upsert the xattr and update the cas via macro expansion, like (gocb code excerpt):

      _, err = bucket.MutateInEx(k, gocb.SubdocDocFlagAccessDeleted, gocb.Cas(cas), uint32(exp)).
      UpsertEx("_sync", xv, gocb.SubdocFlagXattr). // Update the xattr
      UpsertEx("_sync.cas", "${Mutation.CAS}", gocb.SubdocFlagXattr|gocb.SubdocFlagUseMacros). // Stamp the cas on the xattr
      Execute()

      This is successful for 99.9% percent of the docs in the test, but fails sporadically with a KeyNotFound. We've been through the SG code and are pretty confident there isn't a race condition going on in the SG code, and that the document does actually exist in the server at the time the request is made.

      Additional details in this comment, including a link to a pcap for the entire test, and a pcap excerpt for the failing operation.

      Given that this seems to only be reproducible under load, it feels a bit like the server is returning KeyNotFound instead of TMPFAIL, but we haven't been able to identify whether that's actually what's going on. SG has retry handling in place for TMPFAIL, but the KeyNotFound is causing SG to error out of the import.

      Attachments

        1. 468.couch.1
          24 kB
        2. 589.couch.1
          36 kB
        3. pcap18.pcap
          29.71 MB
        4. Screen Shot 2017-07-25 at 12.33.24.png
          Screen Shot 2017-07-25 at 12.33.24.png
          67 kB
        5. Screen Shot 2017-07-25 at 12.39.47.png
          Screen Shot 2017-07-25 at 12.39.47.png
          73 kB
        6. tracecontainer06.pcap
          24.60 MB
        7. tracecontainer45.pcap
          9.54 MB
        8. tracecontainer46.pcap
          3.82 MB
        9. xattr.go
          2 kB

        Issue Links

          For Gerrit Dashboard: MB-25398
          # Subject Branch Project Status CR V

          Activity

            People

              adamf Adam Fraser
              adamf Adam Fraser
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty