Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31481 DCP stream may not send STREAM_END if non-infinity endSeqno and cursor dropping
  3. MB-31570

[BP 5.5.3] - DCP stream may not send STREAM_END if non-infinity endSeqno and cursor dropping

    XMLWordPrintable

Details

    • Technical task
    • Resolution: Fixed
    • Critical
    • 5.5.3
    • 4.5.1, 4.6.0, 4.6.1, 4.6.4, 4.6.2, 5.0.0, 5.0.1, 5.1.0, 5.5.0, 5.5.1, 5.5.2
    • couchbase-bucket, DCP

    Description

      (Backport of MB-31481 to 5.5.x)

      Summary

      KV-Enging may to send a STREAM_END after sending all the mutations for a given DCP stream request, if the DCP stream meets all of the following criteria:

      • The stream request has a specific endSeqno (i.e. not infinity).
      • KV-Engine encounters a high memory situation and triggers cursor dropping on this stream, when it is in the backfilling state.
      • The stream doesn't require streaming any more items (because the first backfill actually contained all required mutations).

      This can result in DCP clients hanging as they are waiting (forever) for the STREAM_END.

      Details

      The bug occurs in a specific scenario:

      1. DCP producer in backfilling
      2. Cursor dropping triggered - which causes the “reserved” cursor for that backfill to be discarded.
        1. Typically this means that when the current backfill completes, we’ll have to schedule a second backfill - as the CheckpointManager likely no longer contains the next seqno we needed (given the cursor holding onto that item has been dropped).
        2. However in the bug scenario - say if no mutations are occurring, or if backfill quickly completes - we correctly re-register the cursor and hence don’t need to so a second backfill.
      3. Finally, when we do re-register the cursor, if we find there’s actually no more mutations requested (backfill provided them all) then there’s no more work and the Stream can end.

      The bug is in how this scenario is handled. During the completion of the backfill; re-registering the cursor and then finding no more data needed, we fail to correctly inform the front-end of the final STREAM_END message. The message has been successfully generated, but it’s stuck on the readyQueue as the front-end doesn’t know to check for it.

      Investigation revealed the underlying cause of the bug:

      1. In the specific sequence above, we initially check for an item begin ready (end of backfill phase) and don’t find one - s far so good.
      2. Then we advance to the next state (in-memory), identify all requested seqnos have been found and push STREAM_END onto the readyQ.
      3. However, we _don’_t re-check the readyQ - as such the front-end incorrectly thinks there’s no items ready yet; and hence goes to sleep (forever!).

      To address this we simply re-check for a response after attempting the second backfill.

      Vulnerable Versions

      This bug is believed[1] to affect all versions which support cursor dropping (MB-9897):

      • From 4.5.1 when cursor dropping was introduced up to and including 4.6.4 (cursor dropping was disabled in 4.6.5 - see MB-29483).
      • From 5.0.0 up to and including 5.1.0 (cursor dropping was disabled in 5.1.1 - see MB-29482).
      • From 5.5.0 upwards.

      [1] As of writing, only confirmed on 5.5.1.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              christopher.farman christopher farman (Inactive)
              christopher.farman christopher farman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty