Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48935

XDCR - BackfillPipeline resuming from checkpoint always gets the latest checkpoint

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.1.0
    • 7.0.0, 7.0.1, 7.0.2, 7.1.0
    • None
    • None
    • Untriaged
    • 1
    • No

    Description

      From MB-48919, we will follow VB 69 on node 120.170 (randomly picked) to trace the path.

      Originally, the Backfill task for the VB69 is to end at 61237

      2021-10-14T04:28:53.472-07:00 INFO GOXDCR.DcpNozzle: dcp_backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0_172.23.120.170:11210_1 Received backfill tasks for vb to endSeqno: map[59:62585 60:37124 61:36424 62:35784 63:38586 64:61891 65:36226 66:61918 67:35924 68:35746 69:61237 71:38950 72:35684 73:35304 74:62209 75:38123 76:35864 77:36524 78:35584 79:36504 80:37173 81:35939 82:36824 85:36132 86:61776 87:36504 88:61169 89:35644 90:35464 91:60840 92:36004 93:61871 94:61366 95:37345 99:60488 100:66702 101:58732 102:36604 103:38788 104:61658 105:36916 106:64251 107:35964 108:35644 109:61356 127:23716 128:23636 241:35824 355:24436 468:23456 469:27692]
      

      However, Checkpoint manager found a checkpoint doc and returned a beginning sequence number of 106624, which is greater than what was asked.

      2021-10-14T04:28:53.810-07:00 INFO GOXDCR.CheckpointMgr: BackfillPipeline backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0 Found checkpoint doc for vb=69
       
      … (metakv) …
       
      cbcollect_info_ns_1@172.23.120.170_20211014-141914/ns_server.metakv.log:[metakv:debug,2021-10-14T04:28:53.351-07:00,ns_1@172.23.120.170:<0.27360.18>:simple_store:iterate_matching:73]Returning Key <<"/ckpt/backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0/69">>.
       
      2021-10-14T04:28:53.814-07:00 INFO GOXDCR.CheckpointMgr: BackfillPipeline backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0 Set VBTimestamp: vb=69, ts.Seqno=106624, ts.SourceManifestId=10 ts.TargetManifestId=1
      

      XDCR has a check in place to prevent this from happening so it sets the start to end-1 (61236), which is still not the whole story because this backfill task should really start at 0.

      2021-10-14T04:28:53.891133-07:00 INFO 530: (GleamBookUsers0) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0_172.23.120.170:11210_1:d38tSYiYCPKUTYjSQE_ofg== - (vb:69) Creating stream with start seqno 61236 and end seqno 61237; requested end seqno was 61237, collections-filter-size:10 sid:none
       
      2021-10-14T04:28:53.891-07:00 [INFO] UPR_STREAMREQ for vb 69 successful
      

      This means that the backfill wasn’t full and it could potential cause missed data.

      In this instance, the backfill completed “successfully” without really transferring any data.

      2021-10-14T04:29:49.691-07:00 INFO GOXDCR.DcpNozzle: dcp_backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0_172.23.120.170:11210_1: seqno: 107497 dcp_backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0_172.23.120.170:11210_1 stream for vb=69 is closed by producer
      

      The reason is because Checkpoint Manager’s beginning sequence will ask for maxUint64 for the seqno to check, and this is problematic if the checkpoint wasn’t deleted cleanly (another MB will be filed and linked) or if peerToPeer pull any other old checkpoints.

      http://src.couchbase.org/source/xref/cheshire-cat/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#1033

      This MB should ensure that:

      1. Checkpoint > than the endSeqno of a backfill task should not be used
      2. If a full complete backfill is needed, this instance of XDCR will not use a checkpoint, even if it is one pulled from a peer.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty