Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44558

Not all HashTable stats cleared on bucket flush

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 5.0.1, 5.1.3, Cheshire-Cat, 5.5.6, 6.5.1, 6.0.5, 6.6.1, 6.5.2
    • 6.6.2, 7.0.0
    • couchbase-bucket
    • Triaged
    • 1
    • Yes
    • KV-Engine 2021-Feb

    Description

      HashTable::clear(), as used during Bucket flush to remove all items from the HashTable does not reset all statistics correctly. The following statistics retain their old values:

      • numDeletedItems - used to calculate curr_items stat amongst others.
      • numSystemItems - used to calculate curr_items stat amongst others.
      • numPreparedSyncWrites - used to calculate curr_items stat amongst others.
      • metaDataMemory - used by ItemPager to calculate pagable memory.

      (Identified during investigation of MB-44452).

      This issue dates back to 5.0.0, when numDeletedItems was added to HashTable, but wasn't reset - see http://review.couchbase.org/c/ep-engine/+/74130. When subsequent similar counters were added (numSystemItems, numPreparedSyncWrites) the same pattern was repeated.

      Impact

      If a bucket is flushed when any of the above {numXXXItems counts is non-zero, then the value of curr_items after the flush operation will not start at zero. This will result in the item counts for that Bucket being biased by the cleared amount, essentially forever. To encounter this one must:

      1. Have at least one of the above counters be non-zero
      2. Issue a Flush.

      For (1), a Persistent Bucket in a quiesced state should have 0 Deleted items, 0 System items and 0 prepared SyncWrites, so the issue shouldn't occur. However if the bucket wasn't quiesced (SyncWrites in progress, deleted items being BG fetched) then they could be non-zero and issue could be hit.
      For Ephemeral buckets the likelihood is greater - all three item types are kept in-memory for extended periods of time.

      A such increasing the severity to Critical given many things in the system rely on accurate item counts.

      Attachments

        For Gerrit Dashboard: MB-44558
        # Subject Branch Project Status CR V

        Activity

          Build couchbase-server-7.0.0-4543 contains kv_engine commit 84dbad1 with commit message:
          MB-44558: HashTable: Reset all item counts on clear()

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4543 contains kv_engine commit 84dbad1 with commit message: MB-44558 : HashTable: Reset all item counts on clear()

          Build couchbase-server-6.6.2-9547 contains kv_engine commit bca68e4 with commit message:
          MB-44558: HashTable: Reset all item counts on clear()

          build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.2-9547 contains kv_engine commit bca68e4 with commit message: MB-44558 : HashTable: Reset all item counts on clear()

          Build couchbase-server-7.0.0-4569 contains kv_engine commit bca68e4 with commit message:
          MB-44558: HashTable: Reset all item counts on clear()

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4569 contains kv_engine commit bca68e4 with commit message: MB-44558 : HashTable: Reset all item counts on clear()

          Build couchbase-server-7.0.0-4603 contains kv_engine commit bca68e4 with commit message:
          MB-44558: HashTable: Reset all item counts on clear()

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4603 contains kv_engine commit bca68e4 with commit message: MB-44558 : HashTable: Reset all item counts on clear()

          Dave Rigby - Could we have repro steps?

          Balakumaran.Gopal Balakumaran Gopal added a comment - Dave Rigby  - Could we have repro steps?
          drigby Dave Rigby added a comment - - edited

          Balakumaran Gopal I only found this in unit-tests, but you should be able to do something like:

          1. Perform SyncWrites against a bucket such that there's some prepares in flight - you can increase the chances of this by say slowing the disk down, or suspending (SIGSTOP) memcached on the replica node.
          2. Perform a Flush
          3. Expected result - curr_items should be zero
          4. Actual result - curr_items is non-zero.
          drigby Dave Rigby added a comment - - edited Balakumaran Gopal I only found this in unit-tests, but you should be able to do something like: Perform SyncWrites against a bucket such that there's some prepares in flight - you can increase the chances of this by say slowing the disk down, or suspending (SIGSTOP) memcached on the replica node. Perform a Flush Expected result - curr_items should be zero Actual result - curr_items is non-zero.

          Closing this based on unit_tests run.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing this based on unit_tests run.

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            drigby Dave Rigby
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There is 1 open Gerrit change

                PagerDuty