Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
3.0
-
Security Level: Public
-
None
-
Untriaged
-
Unknown
Description
From Alk,
I spotted this:
MB-11085: Always create a new failover entry on unclean shutdowns
In the past we wouldn't generate a new failover entry if the high
seqno number on disk was the same after a crash. This is incorrect
because it is possible that the server did receive mutations and
replicated them without persisting them before the crash. If this
happens the consumers of upr streams will no roll back their data
properly because the failover entry will not change on the server.
Change-Id: I8c6bab504f0be3298e1e888dbe6f3fac9c3fa905
Reviewed-on: http://review.couchbase.org/37670
Reviewed-by: Chiyoung Seo <chiyoung@couchbase.com>
Tested-by: Michael Wiederhold <mike@couchbase.com>
And tried it's behavior in practice. It looks like it has reverted to old behavior where it would silently overwrite last failover-history entry uuid if last seqno equals failover-entry seqno.
Thinking about this more I believe it might be fine. But it has interesting consequences.
If I understand failover-history entry seqno as "seqno just before start of new failover 'era'" then it appears perfectly fine to do that.
However I think I'll need to change my code to accomodate for that. And some other upr consumers might have to as well. This is because my checkpointing code always assumes that latest seqno always "belongs" to latest failover-history entry. Which is clearly not the case when last seqno = seqno-of-last-failover-history-entry. In the later case seqno actually belongs to previous entry.
I can adapt my code. Or we can add a simple tweak to upr where it'll create empty, "bubble" seqno when it starts new failover history entry. In that case you will never have a case where on restart your last seqno = last-failover-history-entry-seqno. And there's no problem.
I'm pretty sure that this corner case affects not just xdcr. And I'm willing to bet that nobody handles it right yet. So we need to resolve this case asap.
Attachments
Issue Links
- is triggered by
-
MB-11085 XDCR checkpointing : ep-engine does not generate new failover log after remote node crash
- Closed