Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51391

Potential race leading to crash if KVBucket::deleteVBucket called multiple times for same vb

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown
    • KV March-22

    Description

      Summary

      During Magma testing by a customer, a crash was observed during KVBucket::deleteVBucket when attempting to dereference a null VBucketPtr.

      Details

      This appears to be a latent race condition, where the function performs an initial initial check for a null pointer (and returns early), but we do not check again after acquiring an exclusively locked VBucketPtr. This means that if another thread also is processing a DEL_VBUCKET command, then the state could change during the function's execution:

      cb::engine_errc KVBucket::deleteVBucket(Vbid vbid, const CookieIface* c) {
          // Lock to prevent a race condition between a failed update and add
          // (and delete).
          VBucketPtr vb = vbMap.getBucket(vbid);
          if (!vb) {
              return cb::engine_errc::not_my_vbucket;
          }
       
          {
              std::unique_lock<std::mutex> vbSetLh(vbsetMutex);
              // Obtain a locked VBucket to ensure we interlock with other
              // threads that are manipulating the VB (particularly ones which may
              // try and change the disk revision e.g. deleteAll and compaction).
              auto lockedVB = getLockedVBucket(vbid);
              vbMap.setState(*lockedVB, vbucket_state_dead, nullptr);
              getRWUnderlying(vbid)->abortCompactionIfRunning(lockedVB.getLock(),
                                                              vbid);
              engine.getDcpConnMap().vbucketStateChanged(vbid, vbucket_state_dead);
       
              // Drop the VB to begin the delete, the last holder of the VB will
              // unknowingly trigger the destructor which schedules a deletion task.
              vbMap.dropVBucketAndSetupDeferredDeletion(vbid, c);
          }
          ...
      

      Note that we check for a null VBucketPtr initially at lines 1138-1133, however we do not re-check after taking a locked copy of the VBucketPtr at line 1140.

      Steps to reproduce

      1. Start flushing a given vBucket to disk. This needs to be "slow" - i.e. not finish until step 3 has occurred.
      2. Issue a DEL_VBUCKET request - this will be blocked waiting to acquire the vb_mutexes element for this vBucket.
      3. Issue a second DEL_VBUCKET request (for the same vbid) - this will also be blocked waiting on the same mutex.
      4. Allow VBucket flush to complete.

      After (4), the first DEL_VBUCKET request can run, and should successfully delete the vBucket. However then that finishes, the second DEL_VBUCKET request is allowed to run, which attempts to delete a VBucketPtr which has already been set to null -> crash.

      Note that steps (2) and (3) can be triggered in a full cluster setup by ns_server attempting to delete a VBucket (e.g. as a result of rebalance-out), if the first DEL_VBUCKET call takes longer than ns_server's timeout (value TBD) - this causes ns_server to retry the operation and issue the second DEL_VBUCKET.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            owend Daniel Owen
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty