Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
4.5.1, 4.6.0, 4.6.2, 4.6.1, 4.6.4, 5.0.0, 5.1.0, 5.0.1, 5.5.0, 5.5.1, 5.5.2
-
Untriaged
-
No
Description
Summary
KV-Enging may fail to send a STREAM_END after sending all the mutations for a given DCP stream request, if the DCP stream meets all of the following criteria:
- The stream request has a specific endSeqno (i.e. not infinity).
- KV-Engine encounters a high memory situation and triggers cursor dropping on this stream, when it is in the backfilling state.
- The stream doesn't require streaming any more items (because the first backfill actually contained all required mutations).
This can result in DCP clients hanging as they are waiting (forever) for the STREAM_END.
Details
The bug occurs in a specific scenario:
- DCP producer in backfilling
- Cursor dropping triggered - which causes the “reserved” cursor for that backfill to be discarded.
- Typically this means that when the current backfill completes, we’ll have to schedule a second backfill - as the CheckpointManager likely no longer contains the next seqno we needed (given the cursor holding onto that item has been dropped).
- However in the bug scenario - say if no mutations are occurring, or if backfill quickly completes - we correctly re-register the cursor and hence don’t need to so a second backfill.
- Finally, when we do re-register the cursor, if we find there’s actually no more mutations requested (backfill provided them all) then there’s no more work and the Stream can end.
The bug is in how this scenario is handled. During the completion of the backfill; re-registering the cursor and then finding no more data needed, we fail to correctly inform the front-end of the final STREAM_END message. The message has been successfully generated, but it’s stuck on the readyQueue as the front-end doesn’t know to check for it.
Investigation revealed the underlying cause of the bug:
- In the specific sequence above, we initially check for an item begin ready (end of backfill phase) and don’t find one - s far so good.
- Then we advance to the next state (in-memory), identify all requested seqnos have been found and push STREAM_END onto the readyQ.
- However, we _don’_t re-check the readyQ - as such the front-end incorrectly thinks there’s no items ready yet; and hence goes to sleep (forever!).
To address this we simply re-check for a response after attempting the second backfill.
Vulnerable Versions
This bug is believed[1] to affect all versions which support cursor dropping (MB-9897):
- From 4.5.1 when cursor dropping was introduced up to and including 4.6.4 (cursor dropping was disabled in 4.6.5 - see MB-29483).
- From 5.0.0 up to and including 5.1.0 (cursor dropping was disabled in 5.1.1 - see MB-29482).
- From 5.5.0 upwards.
[1] As of writing, only confirmed on 5.5.1.
Attachments
Issue Links
- backports to
-
MB-31570 [BP 5.5.3] - DCP stream may not send STREAM_END if non-infinity endSeqno and cursor dropping
-
- Closed
-
- is duplicated by
-
MB-31432 cbexport may hang when export a bucket with many invalid json documents
-
- Closed
-
- relates to
-
MB-31580 Cbbackupmgr: Add timeout around DCP
-
- Open
-
- mentioned in
-
Page Loading...
Situation so far:
From the logs, we could see that the lack of a stream closed message directly correlated to the vBuckets that were causing various DCP users to hang (Log example shown below)
// DCP (Producer)
(vb 730) Creating stream with start seqno 0 and end seqno 179632; requested end seqno was 179632
(vb 730) Scheduling backfill from 1 to 179632, reschedule flag : False
(vb 730) Handling slow stream; state_ : backfilling, lastReadSeqno : 0, lastSentSeqno : 0, vBucketHighSeqno : 179633, isBackfillTaskRunning : True
(vb 730) Sending disk snapshot with start seqno 0 and end seqno 179634
(vb 730) Backfill complete, 0 items read from disk, 144 from memory, last seqno read: 179634, pendingBackfill : True
(vb 730) Did not schedule backfill with reschedule : True, tryBackfill : True; backfillStart : 179635, backfillEnd : 179634, flags_ : 32, start_seqno_ : 0, end_seqno_ : 179632, lastReadSeqno : 179634, lastSentSeqno : 179634, curChkSeqno : 179635, itemsReady : True
ActiveStream::transitionState: (vb 730) Transitioning from backfilling to in-memory
(vb 730) Stream closing, sent until seqno 179634 remaining items 0, reason: The stream ended due to all items being streamed
// MISSING MESSAGE:
(vb 730) Stream closed, X items sent from backfill phase, Y items sent from memory phase, Z was last seqno sent
From this, we identified that the reason for the code not entering this state to log that message was due to the stream not being notified when the DCP stream end message is in the ready queue. This was due to the itemsReady flag blocking access to notifyStreamReady inside active_stream.cc from this sequence of events. This issue also seems related to handling a slow stream, as without receiving that message the problematic path of code is never reached. This is likely due to the pendingBackfill flag being set to true after a slow stream being acknowledged.
A simple fix, to force the entry inside notifyStreamReady, has been proposed and seems to fix the immediate issue, but more importantly is to further understand the usage of itemsReady which is the root cause of this. The confusion with this is clarifying its difference to the readyQ being empty as it appears they are not identical. By clarifying this, and ensuring that itemsReady is used correctly throughout the code we can ensure that this and other potential related issues are solved.