Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52793

Replica can persist deletion with xattr datatype but empty value

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      TL;DR: If, in a value eviction bucket, subdoc is used to remove a deleted item's xattrs (e.g., by SyncGateway), a replica may persist an item with xattr datatype still set, but an empty body. This can cause memcached crashes later down the line when that item is read from disk e.g., for a DCP backfill.

      So far, seems most likely to affect clusters with SyncGateway and one or more replicas. Actors other than SG may be able to cause this, and with tight timing this could also be triggered during a vbucket move as part of rebalance with no replicas.


      Setup

      This issue becomes possible if the HashTable on a replica holds a deleted item with xattrs, with a non-resident body.

      There are a handful of routes to this point. The easiest is:

      1. VBucket starts as active.
      2. Has an item with system xattrs (e.g., _sync) deleted. This removes it from the HT too.
      3. Has that item bgfetched back into the HT (e.g., to serve a request to read the _sync xattr).
      4. Has the value for the delete evicted through the standard value eviction under memory pressure. This delete's metadata can now persist in the HT indefinitely.
      5. Later transitions to replica. At this stage the HashTable looks like:

        HashTable[0x119276020] with numItems:1 numInMemory:1 numDeleted:1 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values: 
             SV @0x11857ea80 X.. .D..Cm temp:    seq:1 rev:1 cas:1234 key:"cid:0x0:key, size:4" exp:1657279495 age:0 nru:2 fc:0 vallen:0
                             ^    ^^                                                                                                   ^
                             |    ||                                                                                                   |
               Datatype::XATTR    ||                                                      In-memory value length is zero as not resident
                            Deleted|
                                   |
            Not Resident (bit clear)
        

      As noted in MB-50423, the deleted item's metadata can remain in the HashTable for an arbitrarily long time - until overwritten, memcached restarts, or the vbucket is deleted.
      The vbucket is currently "correct", but now primed for the actual issue to occur...

      Issue

      5. Subdoc is used to remove the _sync system xattr from the discussed document. If that was the only xattr, the item's datatype should now transition to RAW_BYTES, as it has no xattrs at all.

      The active vbucket will manage this correctly, persisting a deleted item, RAW_BYTES, no value.

      However, when this change is replicated over DCP, deleteWithMeta will be used, and will encounter the primed deleted item with a non-resident value. The deleteWithMeta code path will skip several checks which are only relevant to active vbs, and will attempt to delete the stored value, reaching:

      bool StoredValue::deleteImpl(DeleteSource delSource) {
          if (isDeleted() && !getValue()) {
              // SV is already marked as deleted and has no value - no further
              // deletion possible.
              return false;
          }
       
          resetValue();
          setDatatype(PROTOCOL_BINARY_RAW_BYTES);
          setPendingSeqno();
       
          setDeletedPriv(true);
          setDeletionSource(delSource);
          markDirty();
       
          return true;
      }
      

      Unfortunately, the StoredValue is both deleted, and does not have a value in memory. This leads to an early exit, and skips updating several attributes, including the datatype.

      Now the deleteWithMeta will, in EPVBucket::softDeleteStoredValue call queueDirty. This will queue the updated item state for persistence, taken from the SV. Now the replica will persist a deleted item with datatype xattrs and no value. This is invalid, as items with datatype xattr are expected to have actual xattrs in the value - various codepaths trust this assumption.

      For example, the state of the HashTable after the second delete (removing the System XATTR would be:

      HashTable[0x116a0f020] with numItems:1 numInMemory:1 numDeleted:1 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values: 
           SV @0x115d16a80 X.. WD..Cm temp:    seq:2 rev:1 cas:5678 key:"cid:0x0:key, size:4" exp:0 age:0 nru:2 fc:0 vallen:0
                           ^   ^^^                                                                                          ^
                           /   |||                                                             In-memory value length is zero
             Datatype::XATTR   |||
                               /||
                Written (dirty) ||
                                /| 
                         Deleted |
                                 /
         Not Resident (bit clear)
      

      The issue here is we have a dirty (written) item - which by definition must have it's value present, except here it has a zero length value, but the datatype is not RAW_BYTES, it is still XATTR.

      Fallout

      The bad state on disk can go unnoticed for some time. If nothing else occurs within the purge interval, the deleted item may be cleaned up without ever causing visible symptoms. However, if the vbucket becomes active, any subsequent use of that item is quite likely to cause issues.

      E.g.,, streaming all items from disk over DCP may crash memcached with:

      memcached<0.17628.0>: 2022-06-16T10:43:19.365642+02:00 CRITICAL Caught unhandled std::exception-derived exception. what(): GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/xattr/utils.cc: 133
      memcached<0.17628.0>: terminate called after throwing an instance of 'gsl::fail_fast'
      memcached<0.17628.0>:   what():  GSL: Precondition failure at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/xattr/utils.cc: 133
      

      utils.cc

      132
      uint32_t get_body_offset(const cb::const_char_buffer& payload) {
      133
          Expects(payload.size() > 0);
      134
          const uint32_t* lenptr = reinterpret_cast<const uint32_t*>(payload.buf);
      135
          auto len = ntohl(*lenptr);
      136
          check_len(len, payload.size());
      137
          return len + sizeof(uint32_t);
      138
      }
      

      Summary

      What seems to be quite routine SyncGateway behaviour (removing _sync xattr) can lead to bad state on disk on a replica. This may be occurring quite frequently, only becoming a visible issue if that vbucket becomes active or streams items over DCP while remaining a replica - only views streams from replicas as far as I'm aware.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-52793
          # Subject Branch Project Status CR V

          Activity

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              james.harrison James Harrison (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty