Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48855

XDCR - incorrect checkpointing leading to resuming issue

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.1.0
    • 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, 6.6.5, 7.0.0, 7.0.1, 7.0.2, 7.1.0
    • XDCR
    • Untriaged
    • 1
    • No

    Description

      Originally found via diagnosing MB-48771

      Things are timing out because some of the VBs streams are not starting while another set of them finished.

      The things not finished are something like:

      172.23.97.241_20211006-004056/memcached.log

      2021-10-05T13:47:35.061641-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:650) Stream request failed because the snap start seqno (0) <= start seqno (26003) <= snap end seqno (0) is required
      2021-10-05T13:47:35.061709-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:768) Stream request failed because the snap start seqno (0) <= start seqno (9884) <= snap end seqno (0) is required
      2021-10-05T13:47:35.061761-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:943) Stream request failed because the snap start seqno (0) <= start seqno (19933) <= snap end seqno (0) is required
      2021-10-05T13:47:35.061813-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:1008) Stream request failed because the snap start seqno (0) <= start seqno (15134) <= snap end seqno (0) is required
      2021-10-05T13:47:35.061898-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:649) Stream request failed because the snap start seqno (0) <= start seqno (29109) <= snap end seqno (0) is required
      2021-10-05T13:47:35.061965-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:767) Stream request failed because the snap start seqno (0) <= start seqno (15032) <= snap end seqno (0) is required
      2021-10-05T13:49:34.511248-07:00 WARNING 197: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_0:TCT3izVc6n059_1LX2WuxA== - (vb:483) Stream request failed because the snap start seqno (0) <= start seqno (19276) <= snap end seqno (0) is required
      2021-10-05T13:49:34.511315-07:00 WARNING 197: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_0:TCT3izVc6n059_1LX2WuxA== - (vb:768) Stream request failed because the snap start seqno (0) <= start seqno (9884) <= snap end seqno (0) is required
      

      Seems that the range is incorrect:

      2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 483 erro unexpected status ERANGE for response with opaque 3801571
      2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 768 erro unexpected status ERANGE for response with opaque 3801856
      2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 243 erro unexpected status ERANGE for response with opaque 3801331
      2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 767 erro unexpected status ERANGE for response with opaque 3801855
      2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 161 erro unexpected status ERANGE for response with opaque 3801249
      

      Some of these VBs that never started are the same ones that timed out

      2021-10-05T13:50:35.095-07:00 ERRO GOXDCR.ThrSeqTrackSvc: ThroughSeqno backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_ThroughSeqnoTracker background waiting backfill streams to all be replicated timed out. Expected count: 85 actual count 69. VBs waiting on: [304 483 243 161 943 165 167 168 650 1008 244 169 768 207 649 767]
      

      The error from memcached shows that the snapstartSeqno is 0 and endSnapSqno is 0, while the actual requested seqno is not 0.
      This can happen if checkpoint manager can’t find a snapshot:

      zipgrepCbCollectXdcr "Failed to find snapshot for" *241.zip | grep bucket4 | grep 650
      cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T13:31:37.725-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=650, seqno=37622
      cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T14:04:57.285-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=650, seqno=40040
      cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T14:24:45.073-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=650, seqno=40278
      cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T15:29:20.254-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=650, seqno=42655
      

      Checkpoint manager contains a max history of 200 snapshots histories. If the system is busy and the throughSeqno falls behind (theoretically if DCP keeps sending down a bunch of small snapshots that accumulate in the history), then it is possible that by the time checkpoint happens, the snapshot has already rolled over and it can't find snapshot for a throughSeqno.

      I can only look at the DCP data chan length.
      From the graph, I can see that snapshot not found messages usually show up after a long dcpCh event:

      First instances of each cluster

      cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T13:27:30.546-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=167, seqno=99805
      ...
      cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T14:04:57.209-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=750, seqno=80443
      ...
      cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T14:09:27.274-07:00 WARN GOXDCR.CheckpointMgr: backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=602, seqno=66762
      

      When XDCR checkpoint manager can't find the snapshot, it doesn't return anything returns 0 instead. It is then checkpointing with the snapStart is 0 and snapEnd is 0, which is invalid for resume.
      http://src.couchbase.org/source/xref/6.6.1/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#1432

      According to KV documentation, https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/commands/stream-request.md

      PROTOCOL_BINARY_RESPONSE_ERANGE (0x22)

      This error code may be returned for one of the following reason. Check the server logs for more information on which case occurred.

      If the start and end sequence numbers are specified incorrectly. For example the start sequence number cannot be bigger than the current high sequence number in the VBucket. The start sequence number must also be less than the end sequence number.
      If the snapshot start or snapshot end seqno are not specified correctly. For all stream requests the snapshot start seqno must be less than or equal to the start seqno and the start seqno must be less than or equal to the snapshot end seqno.

      In the logs, I'm seeing that both main pipeline and backfill pipeline are seeing these "Failed to find snapshot" error... however, because backfill pipeline gets stopped and restarted at a more frequent interval than main pipeline, it becomes the pipeline that runs into the DCP error... since main pipeline keeps going.

      XDCR's OSO + throughSeqno mechanism has coordinated effort on raising snapshot markers and returning the correct throughSeqno such that it won't violate snapshot boundaries. The problem (at this point anyway) doesn't seem to be the OSO nor throughSeqno, rather just the incorrect checkpointing mechanism

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-48855
          # Subject Branch Project Status CR V

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty