MB-31481 to 5.5.x)
KV-Enging may to send a STREAM_END after sending all the mutations for a given DCP stream request, if the DCP stream meets all of the following criteria:
- The stream request has a specific endSeqno (i.e. not infinity).
- KV-Engine encounters a high memory situation and triggers cursor dropping on this stream, when it is in the backfilling state.
- The stream doesn't require streaming any more items (because the first backfill actually contained all required mutations).
This can result in DCP clients hanging as they are waiting (forever) for the STREAM_END.
The bug occurs in a specific scenario:
- DCP producer in backfilling
- Cursor dropping triggered - which causes the “reserved” cursor for that backfill to be discarded.
- Typically this means that when the current backfill completes, we’ll have to schedule a second backfill - as the CheckpointManager likely no longer contains the next seqno we needed (given the cursor holding onto that item has been dropped).
- However in the bug scenario - say if no mutations are occurring, or if backfill quickly completes - we correctly re-register the cursor and hence don’t need to so a second backfill.
- Finally, when we do re-register the cursor, if we find there’s actually no more mutations requested (backfill provided them all) then there’s no more work and the Stream can end.
The bug is in how this scenario is handled. During the completion of the backfill; re-registering the cursor and then finding no more data needed, we fail to correctly inform the front-end of the final STREAM_END message. The message has been successfully generated, but it’s stuck on the readyQueue as the front-end doesn’t know to check for it.
Investigation revealed the underlying cause of the bug:
- In the specific sequence above, we initially check for an item begin ready (end of backfill phase) and don’t find one - s far so good.
- Then we advance to the next state (in-memory), identify all requested seqnos have been found and push STREAM_END onto the readyQ.
- However, we _don’_t re-check the readyQ - as such the front-end incorrectly thinks there’s no items ready yet; and hence goes to sleep (forever!).
To address this we simply re-check for a response after attempting the second backfill.
This bug is believed to affect all versions which support cursor dropping (
- From 4.5.1 when cursor dropping was introduced up to and including 4.6.4 (cursor dropping was disabled in 4.6.5 - see MB-29483).
- From 5.0.0 up to and including 5.1.0 (cursor dropping was disabled in 5.1.1 - see MB-29482).
- From 5.5.0 upwards.
 As of writing, only confirmed on 5.5.1.