Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.6.0, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.0.4, 7.1.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.2.1, 7.1.5, 7.2.4, 7.0.6, 7.1.7, 7.2.2, 7.1.6, 7.2.3, 7.2.5, 7.6.2, 7.2.6, 7.6.1, 7.6.4
-
Untriaged
-
0
-
Unknown
Description
The following text describes an issue reproduced on master, with some subtle difference for the 7.x branch (explained after the repro steps)
This bug affects vbucket moves, or from KV perspective when a vbucket is deleted or set(dead) and deleted.
Note that the bug exists in all cases if ns_server does either an explicit delete of the active vbucket (which internally will set to dead) or first set-state (to dead) and then delete vbucket. Call these Case1 and Case2 respectively
Firstly as this is where i've initially reproduced the issue let's follow through on master code.
In either of the cases the active vbucket is set to dead and both will execute setVBucketState_UNLOCKED to change to dead.
- https://github.com/couchbase/kv_engine/blob/59fe77df45988e6929053bec92fdb84370d42d72/engines/ep/src/kv_bucket.cc#L944
- From Case1 via the delete vbucket request -> https://github.com/couchbase/kv_engine/blob/59fe77df45988e6929053bec92fdb84370d42d72/engines/ep/src/kv_bucket.cc#L1164
- From Case2 via the set-state(dead) request -> https://github.com/couchbase/kv_engine/blob/59fe77df45988e6929053bec92fdb84370d42d72/engines/ep/src/kv_bucket.cc#L928
The setVBucketState_UNLOCKED code when changing away from active will setup a NonIO task that should unblock all waiting cookies (responding with sync_write_ambiguous), this task is called RespondAmbiguousNotification
Next both Case1 and Case2 are going to delete the VBucket, there are a few steps involved with this, but it begins here.
In summary: The VBucket is managed via a std::shared_ptr and when the last reference to that std::shared_ptr is removed, a delete function runs and schedules a new task to actually delete the vbucket. This is AuxIO for persistent buckets and NonIO task for ephemeral. The task is called VBucketMemoryDeletionTask
Next the bug is more obvious when we look at the run loop of RespondAmbiguousNotification and wonder why it could fail to notify any cookies? There is an early return.
The run function wants to obtain a std::shared_ptr<VBucket> by upgrading the std::weak_ptr. That upgrade will fail if the VBucket has actually been deleted, which happens if the two tasks run in the following order.
- VBucketMemoryDeletionTask -> Delete the VBucket
- RespondAmbiguous -> Fail to get VBucket so skip notify
Now if the bucket is shutdown, we will be stuck waiting for cookies to be notified.
That case has been reproduced using a single-threaded unit test where we can force the task run ordering and indeed we are left with blocked cookies.
Now what about 7.x? I mentioned it's a more obvious problem. If only Case1 occurs (just a delete of the vbucket), 7.x code does not call setVBucketState_UNLOCKED, there is no scheduling of RespondAmbiguousNotification at all on that path!
Addition of calling setVBucketState_UNLOCKED came later in 7.6 via MB-54976
Attachments
Issue Links
- mentioned in
-
Page Loading...
For Gerrit Dashboard: MB-61465 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
208527,27 | Adding functional test for MB-61465 | neo | TAF | Status: NEW | 0 | 0 |