Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42610

Compaction driven expiration of SyncWrite can break HashTable constraint post warmup of incomplete disk snapshot

    XMLWordPrintable

Details

    • Triaged
    • 1
    • No
    • KV-Engine Sprint 2020-Dec, KV-Engine 2021-Jan

    Description

      The normal compaction expiry path is broken if we warmup an incomplete disk snapshot that contains a prepare. In this case we can replace the prepared item (which will have the same cas as the committed item) with a newly deleted item. This is incorrect and can lead to the HashTable having two committed items if we also warmed up the committed item.

      We likely haven't seen this yet as it would require us to fail over enough nodes to flip a partially up to date replica to active.

      Attachments

        For Gerrit Dashboard: MB-42610
        # Subject Branch Project Status CR V

        Activity

          Build couchbase-server-6.6.2-9420 contains kv_engine commit 43c3197 with commit message:
          [BP] MB-42610: Do not expire prepares when committed items exist

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.2-9420 contains kv_engine commit 43c3197 with commit message: [BP] MB-42610 : Do not expire prepares when committed items exist

          1 merge is outstanding (MH->master) but should be fixed in all relevant branches now.

          ben.huddleston Ben Huddleston added a comment - 1 merge is outstanding (MH->master) but should be fixed in all relevant branches now.

          Ashwin Govindarajulu repro steps:

          1. Set up cluster without replica
          2. Write a load of data
          3. Write a load of durable writes to the same keys with level PersistMajority with expiry values
          4. Add a replica
          5. Before replica build completes (but after it starts persisting the prepares of the durable writes) kill the original node, don't bring it back
          6. Kill the replica and let it warmup as the active
          7. Wait for the expiries to process (trigger with gets, the pager, or compaction)

          After the fix we shouldn't crash during the expiries

          ben.huddleston Ben Huddleston added a comment - Ashwin Govindarajulu repro steps: Set up cluster without replica Write a load of data Write a load of durable writes to the same keys with level PersistMajority with expiry values Add a replica Before replica build completes (but after it starts persisting the prepares of the durable writes) kill the original node, don't bring it back Kill the replica and let it warmup as the active Wait for the expiries to process (trigger with gets, the pager, or compaction) After the fix we shouldn't crash during the expiries

          Build couchbase-server-7.0.0-4257 contains kv_engine commit 43c3197 with commit message:
          [BP] MB-42610: Do not expire prepares when committed items exist

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4257 contains kv_engine commit 43c3197 with commit message: [BP] MB-42610 : Do not expire prepares when committed items exist

          Description for release notes:

          Summary: Known Issue If a replica vBucket is promoted to active having only received a partial backfill (a data loss scenario) then a subsequent expiration of an item could expire a pending durable write if it has the same cas. This would cause any future lookups or writes to that key to cause memcached to crash.

          Workaround: Avoid use of expiry with durable writes.

          ben.huddleston Ben Huddleston added a comment - Description for release notes: Summary: Known Issue If a replica vBucket is promoted to active having only received a partial backfill (a data loss scenario) then a subsequent expiration of an item could expire a pending durable write if it has the same cas. This would cause any future lookups or writes to that key to cause memcached to crash. Workaround : Avoid use of expiry with durable writes.

          People

            ben.huddleston Ben Huddleston
            ben.huddleston Ben Huddleston
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There is 1 open Gerrit change

                PagerDuty