Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.0.2, 7.1.0
-
Triaged
-
1
-
Yes
-
KV 2022-Jan
Description
Believe that this was introduced with MB-47267.
Warmup skips deleted items when scanning disk - https://github.com/couchbase/kv_engine/blob/a6acea19e938412df114fe77dfa6a408c2d92424/engines/ep/src/warmup.cc#L517-L524.
The crux of this comes down to not moving ScanContext::lastReadSeqno when we see deleted items in this case. CouchKVStore passes this filter down to couchstore so we won't invoke the LoadStorageKVPairCallback until we find a non-deleted item. MagmaKVStore filters the deletes and moves on to the next item. For both KVStores when we resume a scan we start from lastReadSeqno + 1 if lastReadSeqno != 0. During warmup we decide to pause a scan if more than some fixed amount of time. That time for Backfill tasks if set to 10 milliseconds.
https://github.com/couchbase/kv_engine/blob/a6acea19e938412df114fe77dfa6a408c2d92424/engines/ep/src/warmup.cc#L969-L974
If we have an on disk structure as follows:
[1:alive, 2:deleted, 3:deleted, ..., n:deleted, n+1:alive]
Then we can end up in a scenario where lastReadSeqno gets set to 1 for the first item read, and that item is warmed up. If the scan of 2-n takes more than 10 milliseconds then when we reach the item at n+1 Warmup decides to pause the scan. During the scan from 2-n we don't update lastReadSeqno meaning that the scan gets restarted from 2 rather than n+1 which if disk is consistently slow could result in warmup indefinitely hanging as scans repeat over the same range of deleted items.