Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.1.0
Affects Version/s: 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, 6.6.5, 7.0.0, 7.0.1, 7.0.2, 7.1.0
Component/s: XDCR
Labels:
- request-dev-verify

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
No

Description

Originally found via diagnosing ~~MB-48771~~

Things are timing out because some of the VBs streams are not starting while another set of them finished.

The things not finished are something like:

172.23.97.241_20211006-004056/memcached.log
2021-10-05T13:47:35.061641-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:650) Stream request failed because the snap start seqno (0) <= start seqno (26003) <= snap end seqno (0) is required
2021-10-05T13:47:35.061709-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:768) Stream request failed because the snap start seqno (0) <= start seqno (9884) <= snap end seqno (0) is required
2021-10-05T13:47:35.061761-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:943) Stream request failed because the snap start seqno (0) <= start seqno (19933) <= snap end seqno (0) is required
2021-10-05T13:47:35.061813-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:1008) Stream request failed because the snap start seqno (0) <= start seqno (15134) <= snap end seqno (0) is required
2021-10-05T13:47:35.061898-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:649) Stream request failed because the snap start seqno (0) <= start seqno (29109) <= snap end seqno (0) is required
2021-10-05T13:47:35.061965-07:00 WARNING 7415: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_1:65uI3Eurpo6MIy5dWToplw== - (vb:767) Stream request failed because the snap start seqno (0) <= start seqno (15032) <= snap end seqno (0) is required
…
2021-10-05T13:49:34.511248-07:00 WARNING 197: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_0:TCT3izVc6n059_1LX2WuxA== - (vb:483) Stream request failed because the snap start seqno (0) <= start seqno (19276) <= snap end seqno (0) is required
2021-10-05T13:49:34.511315-07:00 WARNING 197: (bucket4) DCP (Producer) eq_dcpq:xdcr:dcp_backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_172.23.97.241:11210_0:TCT3izVc6n059_1LX2WuxA== - (vb:768) Stream request failed because the snap start seqno (0) <= start seqno (9884) <= snap end seqno (0) is required

Seems that the range is incorrect:

2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 483 erro unexpected status ERANGE for response with opaque 3801571

2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 768 erro unexpected status ERANGE for response with opaque 3801856

2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 243 erro unexpected status ERANGE for response with opaque 3801331

2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 767 erro unexpected status ERANGE for response with opaque 3801855

2021-10-05T13:49:34.511-07:00 [ERROR] UPR_STREAMREQ for vbucket 161 erro unexpected status ERANGE for response with opaque 3801249

Some of these VBs that never started are the same ones that timed out

2021-10-05T13:50:35.095-07:00 ERRO GOXDCR.ThrSeqTrackSvc: ThroughSeqno backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4_ThroughSeqnoTracker background waiting backfill streams to all be replicated timed out. Expected count: 85 actual count 69. VBs waiting on: [304 483 243 161 943 165 167 168 650 1008 244 169 768 207 649 767]

The error from memcached shows that the snapstartSeqno is 0 and endSnapSqno is 0, while the actual requested seqno is not 0.
This can happen if checkpoint manager can’t find a snapshot:

zipgrepCbCollectXdcr "Failed to find snapshot for" *241.zip | grep bucket4 | grep 650

cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T13:31:37.725-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=650, seqno=37622

cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T14:04:57.285-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=650, seqno=40040

cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T14:24:45.073-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=650, seqno=40278

cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T15:29:20.254-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=650, seqno=42655

Checkpoint manager contains a max history of 200 snapshots histories. If the system is busy and the throughSeqno falls behind (theoretically if DCP keeps sending down a bunch of small snapshots that accumulate in the history), then it is possible that by the time checkpoint happens, the snapshot has already rolled over and it can't find snapshot for a throughSeqno.

I can only look at the DCP data chan length.
From the graph, I can see that snapshot not found messages usually show up after a long dcpCh event:

First instances of each cluster
cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T13:27:30.546-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=167, seqno=99805
...
cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T14:04:57.209-07:00 WARN GOXDCR.CheckpointMgr: 5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=750, seqno=80443
...
cbcollect_info_ns_1@172.23.97.241_20211006-004056/ns_server.goxdcr.log:2021-10-05T14:09:27.274-07:00 WARN GOXDCR.CheckpointMgr: backfill_5f4b32de7a13b85221a2eb842ed656ea/bucket4/bucket4 Failed to find snapshot for vb=602, seqno=66762

When XDCR checkpoint manager can't find the snapshot, it doesn't return anything returns 0 instead. It is then checkpointing with the snapStart is 0 and snapEnd is 0, which is invalid for resume.
http://src.couchbase.org/source/xref/6.6.1/goproj/src/github.com/couchbase/goxdcr/pipeline_svc/checkpoint_manager.go#1432

According to KV documentation, https://github.com/couchbase/kv_engine/blob/master/docs/dcp/documentation/commands/stream-request.md

PROTOCOL_BINARY_RESPONSE_ERANGE (0x22)

This error code may be returned for one of the following reason. Check the server logs for more information on which case occurred.

If the start and end sequence numbers are specified incorrectly. For example the start sequence number cannot be bigger than the current high sequence number in the VBucket. The start sequence number must also be less than the end sequence number.
If the snapshot start or snapshot end seqno are not specified correctly. For all stream requests the snapshot start seqno must be less than or equal to the start seqno and the start seqno must be less than or equal to the snapshot end seqno.

In the logs, I'm seeing that both main pipeline and backfill pipeline are seeing these "Failed to find snapshot" error... however, because backfill pipeline gets stopped and restarted at a more frequent interval than main pipeline, it becomes the pipeline that runs into the DCP error... since main pipeline keeps going.

XDCR's OSO + throughSeqno mechanism has coordinated effort on raising snapshot markers and returning the correct throughSeqno such that it won't violate snapshot boundaries. The problem (at this point anyway) doesn't seem to be the OSO nor throughSeqno, rather just the incorrect checkpointing mechanism

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Screen Shot 2021-10-12 at 10.03.53 AM.png
417 kB
12/Oct/21 10:15 AM

Issue Links

blocks

MB-48771 [System Test][XDCR] ThroughSeqnoTracker background waiting backfill streams to all be replicated timed out. Expected count: xx actual count xx - observed in longevity

Closed

is caused by

MB-39437 Remove DCP Notifier connection

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-48855
#	Subject	Branch	Project	Status	CR	V
163491,5	MB-48855 - if snapshot cannot be found, then use seqno as start and end seqno	master	goxdcr	Status: MERGED	+2	+1

XDCR - incorrect checkpointing leading to resuming issue

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty