Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-23503

Insufficient Item removal from HashTable when Rollback to a point in an unpersisted snapshot

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 4.6.2, 5.0.0
    • 4.0.0, 4.5.0, 4.6.1
    • couchbase-bucket
    • None

    Description

      Summary

      When a rollback occurs there is the potential for stale data which should have been discarded to remain in the HashTable. Clients which happen to access the affected sequence numbers will obtain information which should not be part of the current timeline - essentially data has come back from the dead.

      Details

      When the rollback request intends to have a rollback to a point in an unpersisted snapshot, we only remove the hash table items till the requested rollback point, not all the unpersisted items. (http://src.couchbase.org/source/xref/watson/ep-engine/src/ep.cc#3994).

      Example (numbers below are sequence numbers):

      • Node A (Active): 1 to 60;
      • Node B (Replica 1): gets 1 to 55;
      • Node C (Replica 2): gets 1 to 60 and persists till 50.

      Now, if Node A (active) node goes down and Node B is promoted to active, Node C will get a request to rollback till 55, but will actually rollback to last disk snapshot 50 (http://src.couchbase.org/source/xref/watson/ep-engine/src/ep.cc#3996). But, hash table items only > 55 are removed (http://src.couchbase.org/source/xref/watson/ep-engine/src/ep.cc#3953). (This was done as part of a fix to MB-21568). 

      So, we still will have 51 to 55 in HashTable on Node C, which should have been removed. If Node C is subsequently promoted to Active, then clients can access this "from the dead" data.

      This is not an immediate data loss, but results in HashTable showing items that must not be there until they are evicted or updated by a new value. This can also cause inconsistencies/data loss eventually when we use the item metadata from the hashtable, say for conflict resolution, CAS etc..

      Likelihood
      Can happen quite easily when a rollback of a data node happens (during a hard failover). But rollback of data nodes is rare.
      Also, it could go unnoticed when it happens and could later on manifest as a data inconsistency/data loss!

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              manu Manu Dhundi (Inactive)
              manu Manu Dhundi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty