Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31481

DCP stream may not send STREAM_END if non-infinity endSeqno and cursor dropping

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 4.5.1, 4.6.0, 4.6.2, 4.6.1, 4.6.4, 5.0.0, 5.1.0, 5.0.1, 5.5.0, 5.5.1
    • Fix Version/s: 6.0.0
    • Component/s: couchbase-bucket, DCP
    • Labels:
    • Triage:
      Untriaged
    • Is this a Regression?:
      No

      Description

      Summary

      KV-Enging may to send a STREAM_END after sending all the mutations for a given DCP stream request, if the DCP stream meets all of the following criteria:

      • The stream request has a specific endSeqno (i.e. not infinity).
      • KV-Engine encounters a high memory situation and triggers cursor dropping on this stream, when it is in the backfilling state.
      • The stream doesn't require streaming any more items (because the first backfill actually contained all required mutations).

      This can result in DCP clients hanging as they are waiting (forever) for the STREAM_END.

      Details

      The bug occurs in a specific scenario:

      1. DCP producer in backfilling
      2. Cursor dropping triggered - which causes the “reserved” cursor for that backfill to be discarded.
        1. Typically this means that when the current backfill completes, we’ll have to schedule a second backfill - as the CheckpointManager likely no longer contains the next seqno we needed (given the cursor holding onto that item has been dropped).
        2. However in the bug scenario - say if no mutations are occurring, or if backfill quickly completes - we correctly re-register the cursor and hence don’t need to so a second backfill.
      3. Finally, when we do re-register the cursor, if we find there’s actually no more mutations requested (backfill provided them all) then there’s no more work and the Stream can end.

      The bug is in how this scenario is handled. During the completion of the backfill; re-registering the cursor and then finding no more data needed, we fail to correctly inform the front-end of the final STREAM_END message. The message has been successfully generated, but it’s stuck on the readyQueue as the front-end doesn’t know to check for it.

      Investigation revealed the underlying cause of the bug:

      1. In the specific sequence above, we initially check for an item begin ready (end of backfill phase) and don’t find one - s far so good.
      2. Then we advance to the next state (in-memory), identify all requested seqnos have been found and push STREAM_END onto the readyQ.
      3. However, we _don’_t re-check the readyQ - as such the front-end incorrectly thinks there’s no items ready yet; and hence goes to sleep (forever!).

      To address this we simply re-check for a response after attempting the second backfill.

      Vulnerable Versions

      This bug is believed[1] to affect all versions which support cursor dropping (MB-9897):

      • From 4.5.1 when cursor dropping was introduced up to and including 4.6.4 (cursor dropping was disabled in 4.6.5 - see MB-29483).
      • From 5.0.0 up to and including 5.1.0 (cursor dropping was disabled in 5.1.1 - see MB-29482).
      • From 5.5.0 upwards.

      [1] As of writing, only confirmed on 5.5.1.

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            christopher.farman Christopher Farman added a comment - - edited

            Situation so far:

            From the logs, we could see that the lack of a stream closed message directly correlated to the vBuckets that were causing various DCP users to hang (Log example shown below)

            // DCP (Producer)
            (vb 730) Creating stream with start seqno 0 and end seqno 179632; requested end seqno was 179632
            (vb 730) Scheduling backfill from 1 to 179632, reschedule flag : False
            (vb 730) Handling slow stream; state_ : backfilling, lastReadSeqno : 0, lastSentSeqno : 0, vBucketHighSeqno : 179633, isBackfillTaskRunning : True
            (vb 730) Sending disk snapshot with start seqno 0 and end seqno 179634
            (vb 730) Backfill complete, 0 items read from disk, 144 from memory, last seqno read: 179634, pendingBackfill : True
            (vb 730) Did not schedule backfill with reschedule : True, tryBackfill : True; backfillStart : 179635, backfillEnd : 179634, flags_ : 32, start_seqno_ : 0, end_seqno_ : 179632, lastReadSeqno : 179634, lastSentSeqno : 179634, curChkSeqno : 179635, itemsReady : True
            ActiveStream::transitionState: (vb 730) Transitioning from backfilling to in-memory
            (vb 730) Stream closing, sent until seqno 179634 remaining items 0, reason: The stream ended due to all items being streamed
             
            // MISSING MESSAGE:
            (vb 730) Stream closed, X items sent from backfill phase, Y items sent from memory phase, Z was last seqno sent
            

            From this, we identified that the reason for the code not entering this state to log that message was due to the stream not being notified when the DCP stream end message is in the ready queue. This was due to the itemsReady flag blocking access to notifyStreamReady inside active_stream.cc from this sequence of events. This issue also seems related to handling a slow stream, as without receiving that message the problematic path of code is never reached. This is likely due to the pendingBackfill flag being set to true after a slow stream being acknowledged.

            A simple fix, to force the entry inside notifyStreamReady, has been proposed and seems to fix the immediate issue, but more importantly is to further understand the usage of itemsReady which is the root cause of this. The confusion with this is clarifying its difference to the readyQ being empty as it appears they are not identical. By clarifying this, and ensuring that itemsReady is used correctly throughout the code we can ensure that this and other potential related issues are solved.

            Show
            christopher.farman Christopher Farman added a comment - - edited Situation so far: From the logs, we could see that the lack of a stream closed message directly correlated to the vBuckets that were causing various DCP users to hang (Log example shown below) // DCP (Producer) (vb 730) Creating stream with start seqno 0 and end seqno 179632; requested end seqno was 179632 (vb 730) Scheduling backfill from 1 to 179632, reschedule flag : False (vb 730) Handling slow stream; state_ : backfilling, lastReadSeqno : 0, lastSentSeqno : 0, vBucketHighSeqno : 179633, isBackfillTaskRunning : True (vb 730) Sending disk snapshot with start seqno 0 and end seqno 179634 (vb 730) Backfill complete, 0 items read from disk, 144 from memory, last seqno read: 179634, pendingBackfill : True (vb 730) Did not schedule backfill with reschedule : True, tryBackfill : True; backfillStart : 179635, backfillEnd : 179634, flags_ : 32, start_seqno_ : 0, end_seqno_ : 179632, lastReadSeqno : 179634, lastSentSeqno : 179634, curChkSeqno : 179635, itemsReady : True ActiveStream::transitionState: (vb 730) Transitioning from backfilling to in-memory (vb 730) Stream closing, sent until seqno 179634 remaining items 0, reason: The stream ended due to all items being streamed   // MISSING MESSAGE: (vb 730) Stream closed, X items sent from backfill phase, Y items sent from memory phase, Z was last seqno sent From this, we identified that the reason for the code not entering this state to log that message was due to the stream not being notified when the DCP stream end message is in the ready queue. This was due to the itemsReady flag blocking access to notifyStreamReady inside active_stream.cc from this sequence of events. This issue also seems related to handling a slow stream, as without receiving that message the problematic path of code is never reached. This is likely due to the pendingBackfill flag being set to true after a slow stream being acknowledged. A simple fix, to force the entry inside notifyStreamReady, has been proposed and seems to fix the immediate issue, but more importantly is to further understand the usage of itemsReady which is the root cause of this. The confusion with this is clarifying its difference to the readyQ being empty as it appears they are not identical. By clarifying this, and ensuring that itemsReady is used correctly throughout the code we can ensure that this and other potential related issues are solved.
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.0.0-1686 contains kv_engine commit 66985a8 with commit message:
            MB-31481: Don't skip STREAM_END msg after cursor drop during backfill

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.0.0-1686 contains kv_engine commit 66985a8 with commit message: MB-31481 : Don't skip STREAM_END msg after cursor drop during backfill
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.5.0-1405 contains kv_engine commit 81226d8 with commit message:
            Merge MB-31481 from branch 'alice'

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-1405 contains kv_engine commit 81226d8 with commit message: Merge MB-31481 from branch 'alice'
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.5.0-1405 contains kv_engine commit 66985a8 with commit message:
            MB-31481: Don't skip STREAM_END msg after cursor drop during backfill

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-1405 contains kv_engine commit 66985a8 with commit message: MB-31481 : Don't skip STREAM_END msg after cursor drop during backfill
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-5.5.3-4002 contains kv_engine commit 6c5303a with commit message:
            MB-31570: [BP] Don't skip STREAM_END msg after cursor drop during backfill

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-5.5.3-4002 contains kv_engine commit 6c5303a with commit message: MB-31570 : [BP] Don't skip STREAM_END msg after cursor drop during backfill
            Hide
            thuan Thuan Nguyen added a comment -

            As Patrick Varley mention in comment in ticket MB-31432, the issue in this ticket is the cause of the hang issue in ticket MB-31432.
            I did verify the fix in ticket MB-31432. So is this ticket. I will close it. If any one hit the issue in this ticket again, please reopen it.

            Show
            thuan Thuan Nguyen added a comment - As Patrick Varley mention in comment in ticket MB-31432 , the issue in this ticket is the cause of the hang issue in ticket MB-31432 . I did verify the fix in ticket MB-31432 . So is this ticket. I will close it. If any one hit the issue in this ticket again, please reopen it.
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.0.0-1694 contains kv_engine commit 6c5303a with commit message:
            MB-31570: [BP] Don't skip STREAM_END msg after cursor drop during backfill

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.0.0-1694 contains kv_engine commit 6c5303a with commit message: MB-31570 : [BP] Don't skip STREAM_END msg after cursor drop during backfill
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.5.0-1447 contains kv_engine commit 6c5303a with commit message:
            MB-31570: [BP] Don't skip STREAM_END msg after cursor drop during backfill

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-1447 contains kv_engine commit 6c5303a with commit message: MB-31570 : [BP] Don't skip STREAM_END msg after cursor drop during backfill

              People

              • Assignee:
                christopher.farman Christopher Farman
                Reporter:
                christopher.farman Christopher Farman
              • Votes:
                0 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty

                    Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.