Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37681

Handle disconnect during an initial complete DCP stream

    XMLWordPrintable

Details

    • Improvement
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 6.5.0
    • Morpheus
    • couchbase-bucket

    Description

      Problem

      Currently it is not possible to continue on after a disconnect if the client is behind the purge seqno.  This situation is common in flaky network environments, such as WAN, links, and effects the feasibility of collecting a full backup from a large data source.

      This issue is expected to be hit when backing-up to cloud.

      A possible solution is to keep the file handles open after a disconnect for a short / configurable period of time.  Then if the client re-connects with the same DCP stream and details (e.g. end snapshot) the stream can continue where it left of, using the old file handle.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            This also popped up with a XDCR case where the network connection is bit flakey. It could be argued that XDCR is in a better position than backup to handle this as it does not have to restart the DCP stream, but that is not the current design of XDCR. There are a number of time when it will restart the whole pipeline. Adding this logic into the consumer is doable but does make each client more complex.

            pvarley Patrick Varley added a comment - This also popped up with a XDCR case where the network connection is bit flakey. It could be argued that XDCR is in a better position than backup to handle this as it does not have to restart the DCP stream, but that is not the current design of XDCR. There are a number of time when it will restart the whole pipeline. Adding this logic into the consumer is doable but does make each client more complex.
            drigby Dave Rigby added a comment -

            Discussed this with Patrick Varley and Daniel Owen. To summarise, I anticipate the impact of the the reported backup scenario is significantly less for CB Server 6.6+, due to MB-37680 (Sequential Backfill support).

            For example, assume that the purge seqno is located ~90% of the way through each vBucket (in terms of items). Prior to MB-37690, if cbbackup got say 80% of the way through streaming data from KV-Engine via DCP and was disconnected, then given all vBuckets are backfilled concurrently then no vBuckets would have been streamed up to at least the purge seqno, and all vBuckets would have to rollback on reconnect.

            After MB-37690, if the vBuckets are being backfilled sequentially, then ~80% of vBuckets would have been completely streamed (and hence past purge seqno). On disconnect / reconnect, approximately ~20% of vBuckets would require rollback - which seems reasonable given that's proportional to how much data was not sent.

            drigby Dave Rigby added a comment - Discussed this with Patrick Varley and Daniel Owen . To summarise, I anticipate the impact of the the reported backup scenario is significantly less for CB Server 6.6+, due to MB-37680 (Sequential Backfill support). For example, assume that the purge seqno is located ~90% of the way through each vBucket (in terms of items). Prior to MB-37690, if cbbackup got say 80% of the way through streaming data from KV-Engine via DCP and was disconnected, then given all vBuckets are backfilled concurrently then no vBuckets would have been streamed up to at least the purge seqno, and all vBuckets would have to rollback on reconnect. After MB-37690, if the vBuckets are being backfilled sequentially, then ~80% of vBuckets would have been completely streamed (and hence past purge seqno). On disconnect / reconnect, approximately ~20% of vBuckets would require rollback - which seems reasonable given that's proportional to how much data was not sent.
            drigby Dave Rigby added a comment - - edited

            Little more detail on proposed solution - if a stream is in backfilling state (from disk), then add a new DCP control message which the DCP client can negotiate:

            backfill_resume - Request that disk backfills which are in-progress when the DCP connection is closed (i.e. due to network disconnect) are not immediately closed by KV-Engine, but instead the associated resources (i.e couchstore file snapshot) are preserved for a limited grace period (e.g. 60s).

            If the same DCP client (same name presented) re-connects within the grace period, and presents the same snapshot start / end as it was previously in the middle of (see https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/building-a-simple-client.md#restarting-from-where-you-left-off) then the same couchstore file snapshot is re-used and hence the client can resume from the same state, without being subject to purge seqno advancing.

            Implementation Sketch

            (From some discussions with Jim Walker a while back, assuming I remember correctly...)

            1. We introduce a backfill FileHandle cache - something like a map of:

              (DCP client name, Vbid) -> (timestamp, KVFileHandle, HighSeqno)
              

            2. On successful negotiation of the new backfill_resume flag, if a DCP Producer is closed due to disconnect and there are any backfills still in progress, then instead of destroying the ScanContext from the backfill, transfer its FileHandle to the new Backfill FileHandle cache.
            3. If the DCP client reconnects, and has negotiated backfill_resume, then on a StreamRequest check the requested vbid and snap end. If the backfill cache contains an entry with the given DCP client name and Vbid, and snap_end== ScanCache high_seqno, then use the cached KVFileHandle for the ScanContext. If not then open a new one as normal.
              (Note: We cannot re-use the ScanContext (and internal KVStore iterator) as-is, because the client may not have received all the mutations KV-Engine transmitted - i.e. the iterator may be too far advanced. Instead we just re-use the FileHandle (i.e. couchstore file snapshot).
            4. Periodically (every minute?) a background task checks the timestamps of all items in the Backfill FileHandle cache. Any which are older than a the grace period are removed (freeing up the underlying FileHandle / snapshot).
            drigby Dave Rigby added a comment - - edited Little more detail on proposed solution - if a stream is in backfilling state (from disk), then add a new DCP control message which the DCP client can negotiate: backfill_resume - Request that disk backfills which are in-progress when the DCP connection is closed (i.e. due to network disconnect) are not immediately closed by KV-Engine, but instead the associated resources (i.e couchstore file snapshot) are preserved for a limited grace period (e.g. 60s). If the same DCP client (same name presented) re-connects within the grace period, and presents the same snapshot start / end as it was previously in the middle of (see https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/building-a-simple-client.md#restarting-from-where-you-left-off ) then the same couchstore file snapshot is re-used and hence the client can resume from the same state, without being subject to purge seqno advancing. Implementation Sketch (From some discussions with Jim Walker a while back, assuming I remember correctly...) We introduce a backfill FileHandle cache - something like a map of: (DCP client name, Vbid) -> (timestamp, KVFileHandle, HighSeqno) On successful negotiation of the new backfill_resume flag, if a DCP Producer is closed due to disconnect and there are any backfills still in progress, then instead of destroying the ScanContext from the backfill, transfer its FileHandle to the new Backfill FileHandle cache. If the DCP client reconnects, and has negotiated backfill_resume , then on a StreamRequest check the requested vbid and snap end. If the backfill cache contains an entry with the given DCP client name and Vbid, and snap_end== ScanCache high_seqno , then use the cached KVFileHandle for the ScanContext. If not then open a new one as normal. (Note: We cannot re-use the ScanContext (and internal KVStore iterator) as-is, because the client may not have received all the mutations KV-Engine transmitted - i.e. the iterator may be too far advanced. Instead we just re-use the FileHandle (i.e. couchstore file snapshot). Periodically (every minute?) a background task checks the timestamps of all items in the Backfill FileHandle cache. Any which are older than a the grace period are removed (freeing up the underlying FileHandle / snapshot).

            People

              drigby Dave Rigby
              owend Daniel Owen
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty