Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.5.0
-
Triaged
-
Unknown
-
KV Spint 2020-March
Description
We have a bug in EPBucket so that the Range info of a Disk-Snapshot could be lost at persistence at Replica.
That may happen when CouchKVStore::commit fails and the flush succeeds in a re-try.
- Assume no item on disk.
- Replica receives Snap {RangeInfo, {P1, M2}}
- The flusher runs
- It gets the RangeInfo + items to persist from the CheckpointManager
- It tries CouchKVStore::commit, which fails
- At that point in the flusher we add the failed-to-commit items into the VBucket::rejectQueue for deferred processing. Note that the rejectQueue contains only items, no RangeInfo.
- The flusher runs again
- This time it gets the items to persist from the VBucket::rejectQueue
- The flush succeeds
- But the RangeInfo has not been persisted to disk.
RangeInfo contains a number of things:
- Snap start/end - Missing those could lead to a number of issues, eg they are used to determine the failover branch point at active->replica promotion (
MB-35003) - HCS - Missing to persist the HCS could lead to the same error scenario already faced in
MB-36971 - max-deletion-rev-seqno - Missing to persist it may break what we wanted to prevent by fixing
MB-31450