Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37546

Flusher may fail to persist Range information when re-attempting after couchstore failure

    XMLWordPrintable

Details

    • Triaged
    • Unknown
    • KV Spint 2020-March

    Description

      We have a bug in EPBucket so that the Range info of a Disk-Snapshot could be lost at persistence at Replica.
      That may happen when CouchKVStore::commit fails and the flush succeeds in a re-try.

      1. Assume no item on disk.
      2. Replica receives Snap {RangeInfo, {P1, M2}}
      3. The flusher runs
        1. It gets the RangeInfo + items to persist from the CheckpointManager
        2. It tries CouchKVStore::commit, which fails
      4. At that point in the flusher we add the failed-to-commit items into the VBucket::rejectQueue for deferred processing. Note that the rejectQueue contains only items, no RangeInfo.
      5. The flusher runs again
        1. This time it gets the items to persist from the VBucket::rejectQueue
        2. The flush succeeds
        3. But the RangeInfo has not been persisted to disk.

       

      RangeInfo contains a number of things:

      • Snap start/end - Missing those could lead to a number of issues, eg they are used to determine the failover branch point at active->replica promotion (MB-35003)
      • HCS - Missing to persist the HCS could lead to the same error scenario already faced in MB-36971
      • max-deletion-rev-seqno - Missing to persist it may break what we wanted to prevent by fixing MB-31450

      Attachments

        Activity

          People

            paolo.cocchi Paolo Cocchi
            paolo.cocchi Paolo Cocchi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty