Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.1.0
Affects Version/s: 7.1.0
Component/s: couchbase-bucket
Labels:
- request-dev-verify

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown
Sprint:
KV March-22

Description

Summary

During Magma testing by a customer, a crash was observed during KVBucket::deleteVBucket when attempting to dereference a null VBucketPtr.

Details

This appears to be a latent race condition, where the function performs an initial initial check for a null pointer (and returns early), but we do not check again after acquiring an exclusively locked VBucketPtr. This means that if another thread also is processing a DEL_VBUCKET command, then the state could change during the function's execution:

cb::engine_errc KVBucket::deleteVBucket(Vbid vbid, const CookieIface* c) {

    // Lock to prevent a race condition between a failed update and add

    // (and delete).

    VBucketPtr vb = vbMap.getBucket(vbid);

    if (!vb) {

        return cb::engine_errc::not_my_vbucket;

        std::unique_lock<std::mutex> vbSetLh(vbsetMutex);

        // Obtain a locked VBucket to ensure we interlock with other

        // threads that are manipulating the VB (particularly ones which may

        // try and change the disk revision e.g. deleteAll and compaction).

        auto lockedVB = getLockedVBucket(vbid);

        vbMap.setState(*lockedVB, vbucket_state_dead, nullptr);

        getRWUnderlying(vbid)->abortCompactionIfRunning(lockedVB.getLock(),

                                                        vbid);

        engine.getDcpConnMap().vbucketStateChanged(vbid, vbucket_state_dead);

        // Drop the VB to begin the delete, the last holder of the VB will

        // unknowingly trigger the destructor which schedules a deletion task.

        vbMap.dropVBucketAndSetupDeferredDeletion(vbid, c);

...

Note that we check for a null VBucketPtr initially at lines 1138-1133, however we do not re-check after taking a locked copy of the VBucketPtr at line 1140.

Steps to reproduce

Start flushing a given vBucket to disk. This needs to be "slow" - i.e. not finish until step 3 has occurred.
Issue a DEL_VBUCKET request - this will be blocked waiting to acquire the vb_mutexes element for this vBucket.
Issue a second DEL_VBUCKET request (for the same vbid) - this will also be blocked waiting on the same mutex.
Allow VBucket flush to complete.

After (4), the first DEL_VBUCKET request can run, and should successfully delete the vBucket. However then that finishes, the second DEL_VBUCKET request is allowed to run, which attempts to delete a VBucketPtr which has already been set to null -> crash.

Note that steps (2) and (3) can be triggered in a full cluster setup by ns_server attempting to delete a VBucket (e.g. as a result of rebalance-out), if the first DEL_VBUCKET call takes longer than ns_server's timeout (value TBD) - this causes ns_server to retry the operation and issue the second DEL_VBUCKET.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Daniel Owen

Reporter:: Dave Rigby (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Due:: 11/Mar/22

Created:: 11/Mar/22 1:12 AM

Updated:: 24/Mar/22 9:53 AM

Resolved:: 11/Mar/22 7:21 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 3 closed Gerrit changes

Hide There are 3 closed Gerrit changes

MB-51391: Avoid crash when attempting to double-delete vBucket: Gerrit Review:

MB-51391: Check that null VBucketPtr is not dereferenced: Gerrit Review:

Merge branch 'neo' into 'master': Gerrit Review:

Potential race leading to crash if KVBucket::deleteVBucket called multiple times for same vb

Details

Description

Summary

Details

Steps to reproduce

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty