Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-9897

Implement DCP cursor dropping

    XMLWordPrintable

Details

    • Task
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 3.0
    • 4.5.1
    • couchbase-bucket
    • Security Level: Public
    • None
    • 3
    • Mar 9 - Mar 27, KV: May 30 - June 10

    Description

      When DCP streams are created, once the backfill stage is complete they move into a streaming phase. At this point the DCP stream has a cursor pointing at the point in the checkpoints that they are up to.

      If a DCP client end up lagging behind (due to network bandwidth limits, or general slow processing) then it's cursor essentially keeps open checkpoints which we would like to discard (assuming the persistence cursor has finished with them). The effect of this is that we can end up keeping large numbers of checkpoint items in memory as we need to keep them around to stream to the (behind) client. In the worst case this has resulted in KV-engine running out-of-memory. See the linked MBs below.

      The proposal to address this is to allow "cursor dropping" - if a client gets too far behind then we drop the cursor, allowing us to free the any checkpoints held up by it.

      The initial thought was to actually drop the whole DCP stream - i.e tell the client it had ended / been disconnected and forcing them to reconnect. However this was deemed undesirable from the client's pov. The follow-up / alternative proposal is to instead transition the stream to the "backfilling" stage - this allows the checkpoint cursor to be removed, but the client can stay connected - we essentially "re-backfill" from where they reached up to the new current high sequence number.

      Design spec: https://docs.google.com/document/d/15baNgCbG7K_EYWnvBhltFER0RBrVkKlDO-wMomlTq-Y/edit

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Hi Eric,
            If I'm understanding the situation correctly, then functional testing is the way to go, as it appears cluster sizing is not the determinate factor:
            "checkpoint items are taking up all the available memory, and replication is continuously backing off because replication_threshold (99% of quota) has been reached." - MB-15082

            Hence MB-14591, a 100MB bucket gets backoffs loading only 20 items because checkpoints are taking up the memory, and the fix here for cursor dropping is to drop the checkpoints, where functionally it's easier to do a lot of the granular stat checks for replica items and num backoffs.

            tommie Tommie McAfee added a comment - Hi Eric, If I'm understanding the situation correctly, then functional testing is the way to go, as it appears cluster sizing is not the determinate factor: "checkpoint items are taking up all the available memory, and replication is continuously backing off because replication_threshold (99% of quota) has been reached." - MB-15082 Hence MB-14591 , a 100MB bucket gets backoffs loading only 20 items because checkpoints are taking up the memory, and the fix here for cursor dropping is to drop the checkpoints, where functionally it's easier to do a lot of the granular stat checks for replica items and num backoffs.

            I am attempting to run a test scenario which demonstrates the problem in 4.5.0 and in the same scenario show that 4.5.1 does not have the problem. I used the script included in MB-14591 and it does repro the problem in 4.5.0 but the same problem also appears in 4.5.1. With the same test I increased the bucket size to 1G and then the problem does not appear in either 4.5.0 and 4.5.1. I continue to research this.

            ericcooper Eric Cooper (Inactive) added a comment - I am attempting to run a test scenario which demonstrates the problem in 4.5.0 and in the same scenario show that 4.5.1 does not have the problem. I used the script included in MB-14591 and it does repro the problem in 4.5.0 but the same problem also appears in 4.5.1. With the same test I increased the bucket size to 1G and then the problem does not appear in either 4.5.0 and 4.5.1. I continue to research this.

            For the big-set script from MB-14591 I was using 20 kvs and saw no improvement. I increased to 50 kvs and for 4.5.0 there is a no-memory-error on the 36th key, and for 4.5.1 there is an no-memory-error after 46 keys, so things are better.

            ericcooper Eric Cooper (Inactive) added a comment - For the big-set script from MB-14591 I was using 20 kvs and saw no improvement. I increased to 50 kvs and for 4.5.0 there is a no-memory-error on the 36th key, and for 4.5.1 there is an no-memory-error after 46 keys, so things are better.

            I don't know how the test script runs. But seeing no-memory error at a later point could be an indication of improvement due to cursor dropping.

            Steps to reproduce the problem is
            1) Open a slow DCP client connection.
            2) Write a lot of items until we hit heavy DGM (say 100 items).
            3) DCP stream 10 items. (stream request start = 0; end = inf; but stream only upto 10)
            4) Pause load for a while so that new checkpoint is created.
            5) Now writing more items should cause OOM in 4.5.0, but not in 4.5.1

            manu Manu Dhundi (Inactive) added a comment - I don't know how the test script runs. But seeing no-memory error at a later point could be an indication of improvement due to cursor dropping. Steps to reproduce the problem is 1) Open a slow DCP client connection. 2) Write a lot of items until we hit heavy DGM (say 100 items). 3) DCP stream 10 items. (stream request start = 0; end = inf; but stream only upto 10) 4) Pause load for a while so that new checkpoint is created. 5) Now writing more items should cause OOM in 4.5.0, but not in 4.5.1

            I used this command to introduce network delay:
            tc qdisc add dev eth0 root netem delay 100ms

            And then I ran pillow fight:
            /usr/bin/cbc-pillowfight -I 40000000 -m 1000 -M 1000 -U couchbase://172.23.106.25/default -t 50 -p load_1_ -r 100

            And I did see cursor dropping.

            ericcooper Eric Cooper (Inactive) added a comment - I used this command to introduce network delay: tc qdisc add dev eth0 root netem delay 100ms And then I ran pillow fight: /usr/bin/cbc-pillowfight -I 40000000 -m 1000 -M 1000 -U couchbase://172.23.106.25/default -t 50 -p load_1_ -r 100 And I did see cursor dropping.

            People

              ericcooper Eric Cooper (Inactive)
              mikew Mike Wiederhold [X] (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty