Details
-
Improvement
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
6.5.0
Description
Problem
Currently it is not possible to continue on after a disconnect if the client is behind the purge seqno. This situation is common in flaky network environments, such as WAN, links, and effects the feasibility of collecting a full backup from a large data source.
This issue is expected to be hit when backing-up to cloud.
A possible solution is to keep the file handles open after a disconnect for a short / configurable period of time. Then if the client re-connects with the same DCP stream and details (e.g. end snapshot) the stream can continue where it left of, using the old file handle.
Attachments
Issue Links
Gerrit Reviews
Activity
Field | Original Value | New Value |
---|---|---|
Labels | backup |
Issue Type | Task [ 3 ] | Improvement [ 4 ] |
Summary | Disconnect during an initial complete DCP stream | Handle disconnect during an initial complete DCP stream |
Labels | backup | 6.5.2-candidate backup |
Fix Version/s | Cheshire-Cat [ 15915 ] |
Epic Link |
|
Affects Version/s | Mad-Hatter [ 15037 ] | |
Affects Version/s | 6.5.0 [ 16624 ] |
Fix Version/s | 6.5.2 [ 16735 ] | |
Fix Version/s | Cheshire-Cat [ 15915 ] |
Rank | Ranked higher |
Fix Version/s | 6.6.0 [ 16787 ] | |
Fix Version/s | 6.5.2 [ 16735 ] |
Link | This issue relates to MB-38724 [ MB-38724 ] |
Link | This issue blocks MB-38724 [ MB-38724 ] |
Link | This issue relates to MB-38724 [ MB-38724 ] |
Labels | 6.5.2-candidate backup | 6.5.2-candidate approved-for-6.6.0 backup |
Labels | 6.5.2-candidate approved-for-6.6.0 backup | backup |
Assignee | Daniel Owen [ owend ] | Dave Rigby [ drigby ] |
Labels | backup | approved-for-6.6.0 backup |
Priority | Major [ 3 ] | Critical [ 2 ] |
Fix Version/s | Cheshire-Cat [ 15915 ] | |
Fix Version/s | 6.6.0 [ 16787 ] |
Epic Link |
|
Labels | approved-for-6.6.0 backup | backup |
Rank | Ranked higher |
Link | This issue blocks MB-38724 [ MB-38724 ] |
Fix Version/s | CheshireCat.Next [ 16908 ] | |
Fix Version/s | Cheshire-Cat [ 15915 ] |
Fix Version/s | Neo [ 17615 ] | |
Fix Version/s | CheshireCat.Next [ 16908 ] |
Rank | Ranked higher |
Rank | Ranked higher |
Fix Version/s | Morpheus [ 17651 ] | |
Fix Version/s | Neo [ 17615 ] |
Link | This issue relates to CBSE-11668 [ CBSE-11668 ] |
Labels | backup | backup xdcr |
Little more detail on proposed solution - if a stream is in backfilling state (from disk), then add a new DCP control message which the DCP client can negotiate:
backfill_resume - Request that disk backfills which are in-progress when the DCP connection is closed (i.e. due to network disconnect) are not immediately closed by KV-Engine, but instead the associated resources (i.e couchstore file snapshot) are preserved for a limited grace period (e.g. 60s).
If the same DCP client (same name presented) re-connects within the grace period, and presents the same snapshot start / end as it was previously in the middle of (see https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/building-a-simple-client.md#restarting-from-where-you-left-off) then the same couchstore file snapshot is re-used and hence the client can resume from the same state, without being subject to purge seqno advancing.
Implementation Sketch
(From some discussions with Jim Walker a while back, assuming I remember correctly...)
(DCP client name, Vbid) -> (timestamp, KVFileHandle, HighSeqno)
(Note: We cannot re-use the ScanContext (and internal KVStore iterator) as-is, because the client may not have received all the mutations KV-Engine transmitted - i.e. the iterator may be too far advanced. Instead we just re-use the FileHandle (i.e. couchstore file snapshot).