Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48938

XDCR - Backfill Pipeline non-Zero Starting Timestamp is set with snapStart+snapEnd of 0

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.1.0
    • 7.0.0, 7.0.1, 7.0.2, 7.1.0
    • None
    • Untriaged
    • 1
    • No

    Description

      Note that the symptoms between this issue and MB-48855 are identical at first glance.

      From MB-48919, we will follow VB 69 on node 120.170 (randomly picked) to trace the path.
      This issue arises when:

      1. Backfill task starts at a non-0 seqno
      2. There is no backfill checkpoint from which to resume
        This will lead to issue such as MB-48919, even though the bug signature is identical to MB-48855.

      In the logs, we can see checkpoint doc being deleted for the backfill task:

      2021-10-14T04:28:45.319-07:00 INFO GOXDCR.CheckpointSvc: DelCheckpointsDocs is done for backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0
      

      Metakv says so likewise:

      cbcollect_info_ns_1@172.23.120.170_20211014-141914/ns_server.metakv.log:[metakv:debug,2021-10-14T04:28:41.668-07:00,ns_1@172.23.120.170:simple_store_xdcr_ckpt_data<0.490.0>:simple_store:delete_from_store:128]Deleting key <<"/ckpt/backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0/69">> in table xdcr_ckpt_data.
      cbcollect_info_ns_1@172.23.120.170_20211014-141914/ns_server.metakv.log:[metakv:debug,2021-10-14T04:28:41.668-07:00,ns_1@172.23.120.170:<0.10544.17>:menelaus_metakv:handle_mutate:84]Updated <<"/ckpt/backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0/69">>. Elapsed time:0 ms.
      

      Now, when backfill pipeline starts back up, we see that it has only 1 checkpoint doc (for some reason, perhaps due to peerToPeer merge). This is not the issue though.

      2021-10-14T04:30:50.826-07:00 INFO GOXDCR.CheckpointMgr: Found 1 checkpoint documents for BackfillPipeline replication backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0
      2021-10-14T04:30:50.826-07:00 INFO GOXDCR.CheckpointMgr: BackfillPipeline backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0 StartSeqnoGetter 1 is started to do _pre_prelicate for vbs [9 38 64 469 241 4 59 25 109 355 14]
      2021-10-14T04:30:50.826-07:00 INFO GOXDCR.CheckpointMgr: BackfillPipeline backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0 StartSeqnoGetter 0 is started to do _pre_prelicate for vbs [50 11 17 32 30 1 8 91 104 467 61 39 36 26 7 87 20 41 468 89 31 80 77 66 88 53 23 67 82 19 0 37 78 51 102 127 62 107 60 10 28 95 44 5 42 47 74 45 93 86 581 105 16 69 58 2 99 48 15 94 27 18 100 65 40 108 106 85 92 24 79 43 582 29 71 101 90 46 33 63 52 103 21 13 81 76 126 49 22 72 6 128 35 57 3 12 68 73 34 75]
      

      The issue happens that when there is no checkpoint, checkpoint manager’s starting timestamp is set to seqno to be backfill task start, but the snapStart and snapEnd are set to 0.

      2021-10-14T04:28:45.753-07:00 INFO GOXDCR.CheckpointMgr: Checkpointing for BackfillPipeline replication backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0, vb_list=[91 127 581 93 95 86 81 99 74 24 45 355 69 14 13 18 76 109 52 106 61 107], time_to_wait=0s, interval_btwn_vb=0 sec
      cbcollect_info_ns_1@172.23.120.170_20211014-141914/ns_server.goxdcr.log:2021-10-14T04:30:51.035-07:00 INFO GOXDCR.CheckpointMgr: BackfillPipeline backfill_b67b789bcfd95e6b97c8af1e8fa5c7cb/GleamBookUsers0/GleamBookUsers0 Set VBTimestamp: vb=69, ts.Seqno=61237, ts.SourceManifestId=0 ts.TargetManifestId=0
      

      This is because backfill tasks’s book keeping at this point pretty much does not deal with snapStart and snapEnd:

      http://src.couchbase.org/source/xref/cheshire-cat/goproj/src/github.com/couchbase/goxdcr/metadata/backfill_replication.go#1484
      ^
      When we take a segment’s end timestamp (seqno= X snapStart=0 snapEnd=0) and convert it to the beginning of a new segment, we need to set the snapStart and snapEnd to X as to not violate the StreamReq protocol.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-48938
          # Subject Branch Project Status CR V

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty