Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 6.6.0
Affects Version/s: 5.5.6, 6.0.3, 6.5.1, 6.5.0
Component/s: couchbase-bucket
Labels:
None

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown
Sprint:
KV Sprint 2020-June

Description

Summary

There's a bug / limitation in how we handle replicating an item's deletion time over DCP, if the item was deleted before destination node started. The effect of this is that the deletion time of the item on the replica is set to memcached uptime + 1, and hence won't be subject to metadata purging until the entire purge interval has passed, even if the actual item was deleted a long time ago.

Workaround
a) The system should "fix" itself once the new node has been running for the Ephemeral metadata purge interval (currently the default 1 day).
b) Alternatively reducing the metadata purge interval to a lower value (e.g. 1 hour) will cause the newly created tombstones to be purged earlier - although that still requires the destination node has been up for at least 1 hour before anything is purged.

Details

Consider the following example:

ep_ephemeral_metadata_purge_age is set to 1 day (86400 seconds).
The active cluster node was started at 16th June 2010 09:00.
A document is deleted (either explicitly or via expiration) at 09:10.
- Internally this results in the ephemeral OSV having a delete_time of 600 - i.e. 600 seconds since the node started (09:10 - 09:00).

Next, a new node is introduced (say as part of a swap-rebalance):

New node is added as replica the next day - 17th June 2020 09:00 - or in Unix time 1592384400.
Active sends the above deletion via DCP as a DCP_DELETION with delete_time represented as a time_t (seconds since Unix epoch). This will have a value of 1592298600 - 2020-06-16T09:10.
Replica recieves the DCP_DELETION, time is kept as time_t during DCP message handling.
DCP_DELETION flows through system until it reaches VBucket::processSoftDelete where the delete_time (stored in ItemMetaData::exptime) is transferred into the (Ordered)StoredValue which is about to be deleted. It is still in time_t format so far.

Eventually we end up in EphemeralVBucket::softDeleteStoredValue which completes updating the in-memory structures. At this point the "exptime" is used to set the document's deleted time - which is used to determine when it can be purged:

        // Replica/DelWithMeta can dictate the tombstone time, check for it.

        if (queueItmCtx.generateDeleteTime == GenerateDeleteTime::No &&

            newSv->isDeleted() && newSv->getExptime()) {

            // The deleted time is relative and the replicated tombstone time is

            // absolute and held in the expiry field, convert the abs to rel

            // using ep_reltime

            newSv->toOrderedStoredValue()->setDeletedTime(

                    ep_reltime(newSv->getExptime(), {}));

The last two lines convert the absolute time_t to a server-relative time, via ep_reltime() which is a function pointer to mc_time_convert_to_real_time(). However, in this situation the deletion time (1592298600) is less than the server start time (1592384400), and the following case is hit:

mc_time_convert_to_real_time()
/* if item expiration is at/before the server started, give it an
expiration time of 1 second after the server started.
(because 0 means don't expire). without this, we'd
underflow and wrap around to some large value way in the
future, effectively making items expiring in the past
really expiring never */
if (t <= epoch) {
rv = (rel_time_t)1;
} else {
rv = (rel_time_t)(t - epoch);
}

As a result, the Document's deletion time is set to 1 second after server startup - or 17th June 2020 09:00:01 - when it should be 16th June 2010 09:10.

As a consequence, instead of the tombstone being expired at 17th June 2020 09:10 (i.e. 10 minutes after it arrived), it isn't expired until 24hours-1second later.

Impact

Tombstones on newly-built replicas are purged (removed from memory) much later than they should be. Given that it is now allowed to remove tombstones before the metadata purge interval has passed, this can result in significantly more memory being consumed by tombstones than should be.

For LRU ephemeral buckets, this can result in actual "live" documents being deleted to make space for all the extra tombstones. In extreme situations this could end up with all live items being deleted, just to hold the tombstones. Even then that might result in memory over the high watermark and replication stopping.
For Fail-new-data buckets there is no ability to free memory automatically (by deleting items), so the memory usage will likely increase quicker.

Steps to Reproduce

Start a 2 node cluster and create an ephemeral bucket with 1 replica. Reduce bucket's purge interval to a lower number (say 5 minutes).

Store a document in the bucket with a 10s TTL - e.g.:

$ cbc-create -U couchbase://node1/default -u Administrator -P asdasd  DJR --expiry 10  -V test_value

DJR                 Stored. CAS=0x1619e6a2e37c0000

Wait 10s for the document to expire - verify that it has indeed expired with cbc-cat:

$ cbc-cat -U couchbase://node1/default -u Administrator -P asdasd  DJR

DJR                  LCB_KEY_ENOENT (0x0D)

Verify the document is marked as deleted on both active and replica nodes, and has the same del_time - 80 in this case. Note if you use a different key you'll have to update the vbucket in "_hash-dump" from 801.

$ cbstats node-1  -u Administrator -p asdasd raw "_hash-dump 801"

 vb:801: HashTable[0x10da89e20] with numItems:1 numInMemory:1 numDeleted:1 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:

    OSV @0x10dc78960 ... WDR.Cm.TTL temp:    seq:2 rev:2 cas:1592557541450579968 key:"cid:0x0:DJR, size:4" del_time:80 age:0 nru:0 fc:0 vallen:0 prepareSeqno: 0

$ cbstats node-2  -u Administrator -p asdasd raw "_hash-dump 801"

 vb:801: HashTable[0x10da89e20] with numItems:1 numInMemory:1 numDeleted:1 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:

    OSV @0x10dc78960 ... WDR.Cm.TTL temp:    seq:2 rev:2 cas:1592557541450579968 key:"cid:0x0:DJR, size:4" del_time:80 age:0 nru:0 fc:0 vallen:0 prepareSeqno: 0

Wait 2 minutes.

Kill the memcached process on the node which is the replica for vb:801. cbc-hash can be used to determine this:

$ cbc-hash -U couchbase://node-1/default -u Administrator -P asdasd DJR

DJR: [vBucket=801, Index=1] Server: node-2:12002, CouchAPI: http://localhost:9501/default

Replica #0: Index=0, Host=node-1:12000

In this instance the replica is node-1.

Wait a further 3 minutes - such that the metadata purge interval of the original active node's delete has passed. Verify the active node's tombstone has been purged:

$ cbstats node-2:12000  -u Administrator -p asdasd raw "_hash-dump 801"

 vb:801: HashTable[0x106525e20] with numItems:0 numInMemory:0 numDeleted:0 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:

Check the tombstone status on the replica node.

Expected Behaviour
The replica tombstone should also be purged:

$ cbstats node-1:12000  -u Administrator -p asdasd raw "_hash-dump 801"

 vb:801: HashTable[0x106525e20] with numItems:0 numInMemory:0 numDeleted:0 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:

Actual Behaviour
The replica tombstone still exists:

$ cbstats node-1:12000  -u Administrator -p asdasd raw "_hash-dump 801"

 vb:801: HashTable[0x10aeeb420] with numItems:1 numInMemory:1 numDeleted:1 numNonResident:0 numTemp:0 numSystemItems:0 numPreparedSW:0 values:

    OSV @0x10a422240 ... WDR.Cm.TTL temp:    seq:2 rev:2 cas:1592557541450579968 key:"cid:0x0:DJR, size:4" del_time:1 age:0 nru:0 fc:0 vallen:0 prepareSeqno: 0

Note: With the time values above you have just 5 minutes to complete the entire test, and only 3 minutes to check the replica. Adjust the metadata purge interval / wait time as desired depending on how quick you are in running the steps.

Attachments

Issue Links

relates to

MB-34262 Ephemeral tombstone's delete time can be incorrect

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Ashwin Govindarajulu

Reporter:: Dave Rigby (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Due:: 23/Jun/20

Created:: 17/Jun/20 2:18 AM

Updated:: 22/Jun/20 11:34 PM

Resolved:: 22/Jun/20 3:56 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 4 closed Gerrit changes

Hide There are 4 closed Gerrit changes

MB-39993: [Ephemeral] Handle delete_time < server startup time: Gerrit Review:

MB-39993: [Ephemeral] Handle delete_time < server startup time: Gerrit Review:

MB-39993: chrono-based implementation (for master?): Gerrit Review:

Merge branch 'mad-hatter': Gerrit Review:

[Ephemeral] DCP delete time is not handled correctly if < server start time

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty