Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 3.0
Component/s: XDCR
Security Level: Public
Labels:
None

Triage:
Untriaged
Is this a Regression?:
No

Description

[This is more of a design question than a bug]

The dev spec(https://docs.google.com/document/d/17leftKE01b2EKt6AoO-YeNrK3LT7dSwl38F9Glt-Pbs) on checkpointing states -
"xdcr replicator may from _time to time_ perform POST to /_pre_replicate with vb, bucket and vbopaque. To verify that xdcr replicator is still talking to the same vbucket, xdcr replicator started replicating into. 200 response indicates success as usual. And 4xx indicates that remote vbucket might have lost some previously replicated mutations. And thus xdcr replication needs to be restarted from past checkpoint or from the beginning."

So my understanding was - there would be more pre_replicate calls than commit_for_checkpoints and pre_replicates would happen between commit_for_checkpoint calls to detect data loss at destination not having to wait until it's time to checkpoint.

However the checkpoint code - https://github.com/membase/ns_server/blob/12aa7bdf45434e334e2eade6e9e0c84228f0adeb/src/xdc_vbucket_rep_ckpt.erl and tcpdumps from destination reveal that we perform_pre_replicate only when -

1. No checkpoint is found <== this is when replicator starts as a result of a vb receiving its first mutation
2. while parsing an existing checkpoint file <== only in cases of source node restart, vbucket moves or failure in commit_for_checkpoint.

If we pre_replicated more often, would we not detect destination data losses sooner?

Pls feel free to correct me and close this issue if my understanding/reading of erlang code is incorrect.

Attachments

Issue Links

is triggered by

MB-10179 XDCR checkpointing: data loss at destination in cases of destination bucket delete-recreate/flush/failover may be undetected for long periods of time

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Aleksey Kondratenko (Inactive)

Reporter:: Aruna Piravi (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/May/14 8:45 PM

Updated:: 09/May/14 11:58 AM

Resolved:: 09/May/14 11:46 AM

Gerrit Reviews

There are no open Gerrit changes

XDCR checkpointing : would more _pre_replicate calls during replication help detect data losses at destination sooner?

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty