Details
-
Bug
-
Resolution: Fixed
-
Critical
-
5.5.6, 6.0.3, 6.5.1, 6.5.0
-
None
-
Untriaged
-
1
-
Unknown
-
KV Sprint 2020-June
Description
Summary
There's a bug / limitation in how we handle replicating an item's deletion time over DCP, if the item was deleted before destination node started. The effect of this is that the deletion time of the item on the replica is set to memcached uptime + 1, and hence won't be subject to metadata purging until the entire purge interval has passed, even if the actual item was deleted a long time ago.
Workaround
a) The system should "fix" itself once the new node has been running for the Ephemeral metadata purge interval (currently the default 1 day).
b) Alternatively reducing the metadata purge interval to a lower value (e.g. 1 hour) will cause the newly created tombstones to be purged earlier - although that still requires the destination node has been up for at least 1 hour before anything is purged.
Details
Consider the following example:
- ep_ephemeral_metadata_purge_age is set to 1 day (86400 seconds).
- The active cluster node was started at 16th June 2010 09:00.
- A document is deleted (either explicitly or via expiration) at 09:10.
- Internally this results in the ephemeral OSV having a delete_time of 600 - i.e. 600 seconds since the node started (09:10 - 09:00).
Next, a new node is introduced (say as part of a swap-rebalance):
- New node is added as replica the next day - 17th June 2020 09:00 - or in Unix time 1592384400.
- Active sends the above deletion via DCP as a DCP_DELETION with delete_time represented as a time_t (seconds since Unix epoch). This will have a value of 1592298600 - 2020-06-16T09:10.
- Replica recieves the DCP_DELETION, time is kept as time_t during DCP message handling.
- DCP_DELETION flows through system until it reaches VBucket::processSoftDelete where the delete_time (stored in ItemMetaData::exptime) is transferred into the (Ordered)StoredValue which is about to be deleted. It is still in time_t format so far.
- Eventually we end up in EphemeralVBucket::softDeleteStoredValue which completes updating the in-memory structures. At this point the "exptime" is used to set the document's deleted time - which is used to determine when it can be purged:
// Replica/DelWithMeta can dictate the tombstone time, check for it.
if (queueItmCtx.generateDeleteTime == GenerateDeleteTime::No &&
newSv->isDeleted() && newSv->getExptime()) {
// The deleted time is relative and the replicated tombstone time is
// absolute and held in the expiry field, convert the abs to rel
// using ep_reltime
newSv->toOrderedStoredValue()->setDeletedTime(
ep_reltime(newSv->getExptime(), {}));
}
- The last two lines convert the absolute time_t to a server-relative time, via ep_reltime() which is a function pointer to mc_time_convert_to_real_time(). However, in this situation the deletion time (1592298600) is less than the server start time (1592384400), and the following case is hit:
mc_time_convert_to_real_time()
/* if item expiration is at/before the server started, give it an
expiration time of 1 second after the server started.
(because 0 means don't expire). without this, we'd
underflow and wrap around to some large value way in the
future, effectively making items expiring in the past
really expiring never */
if (t <= epoch) {
rv = (rel_time_t)1;
} else {
rv = (rel_time_t)(t - epoch);
}
- As a result, the Document's deletion time is set to 1 second after server startup - or 17th June 2020 09:00:01 - when it should be 16th June 2010 09:10.
As a consequence, instead of the tombstone being expired at 17th June 2020 09:10 (i.e. 10 minutes after it arrived), it isn't expired until 24hours-1second later.
Impact
Tombstones on newly-built replicas are purged (removed from memory) much later than they should be. Given that it is now allowed to remove tombstones before the metadata purge interval has passed, this can result in significantly more memory being consumed by tombstones than should be.
- For LRU ephemeral buckets, this can result in actual "live" documents being deleted to make space for all the extra tombstones. In extreme situations this could end up with all live items being deleted, just to hold the tombstones. Even then that might result in memory over the high watermark and replication stopping.
- For Fail-new-data buckets there is no ability to free memory automatically (by deleting items), so the memory usage will likely increase quicker.
Steps to Reproduce
- Start a 2 node cluster and create an ephemeral bucket with 1 replica. Reduce bucket's purge interval to a lower number (say 5 minutes).
- Store a document in the bucket with a 10s TTL - e.g.:
$ cbc-create -U couchbase://node1/default -u Administrator -P asdasd DJR --expiry 10 -V test_value
DJR Stored. CAS=0x1619e6a2e37c0000
- Wait 10s for the document to expire - verify that it has indeed expired with cbc-cat:
$ cbc-cat -U couchbase://node1/default -u Administrator -P asdasd DJR
DJR LCB_KEY_ENOENT (0x0D)
- Verify the document is marked as deleted on both active and replica nodes, and has the same del_time - 80 in this case. Note if you use a different key you'll have to update the vbucket in "_hash-dump" from 801.
$ cbstats node-1 -u Administrator -p asdasd raw "_hash-dump 801"
vb:801: HashTable[0x10da89e20] with numItems:1 numInMemory:1 numDeleted:1 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:
OSV @0x10dc78960 ... WDR.Cm.TTL temp: seq:2 rev:2 cas:1592557541450579968 key:"cid:0x0:DJR, size:4" del_time:80 age:0 nru:0 fc:0 vallen:0 prepareSeqno: 0
$ cbstats node-2 -u Administrator -p asdasd raw "_hash-dump 801"
vb:801: HashTable[0x10da89e20] with numItems:1 numInMemory:1 numDeleted:1 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:
OSV @0x10dc78960 ... WDR.Cm.TTL temp: seq:2 rev:2 cas:1592557541450579968 key:"cid:0x0:DJR, size:4" del_time:80 age:0 nru:0 fc:0 vallen:0 prepareSeqno: 0
- Wait 2 minutes.
- Kill the memcached process on the node which is the replica for vb:801. cbc-hash can be used to determine this:
$ cbc-hash -U couchbase://node-1/default -u Administrator -P asdasd DJR
DJR: [vBucket=801, Index=1] Server: node-2:12002, CouchAPI: http://localhost:9501/default
Replica #0: Index=0, Host=node-1:12000
In this instance the replica is node-1.
- Wait a further 3 minutes - such that the metadata purge interval of the original active node's delete has passed. Verify the active node's tombstone has been purged:
$ cbstats node-2:12000 -u Administrator -p asdasd raw "_hash-dump 801"
vb:801: HashTable[0x106525e20] with numItems:0 numInMemory:0 numDeleted:0 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:
- Check the tombstone status on the replica node.
Expected Behaviour
The replica tombstone should also be purged:
$ cbstats node-1:12000 -u Administrator -p asdasd raw "_hash-dump 801"
|
vb:801: HashTable[0x106525e20] with numItems:0 numInMemory:0 numDeleted:0 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:
|
Actual Behaviour
The replica tombstone still exists:
$ cbstats node-1:12000 -u Administrator -p asdasd raw "_hash-dump 801"
|
vb:801: HashTable[0x10aeeb420] with numItems:1 numInMemory:1 numDeleted:1 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:
|
OSV @0x10a422240 ... WDR.Cm.TTL temp: seq:2 rev:2 cas:1592557541450579968 key:"cid:0x0:DJR, size:4" del_time:1 age:0 nru:0 fc:0 vallen:0 prepareSeqno: 0
|
Note: With the time values above you have just 5 minutes to complete the entire test, and only 3 minutes to check the replica. Adjust the metadata purge interval / wait time as desired depending on how quick you are in running the steps.
Attachments
Issue Links
- relates to
-
MB-34262 Ephemeral tombstone's delete time can be incorrect
- Closed