Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.2.1, 7.1.5
Affects Version/s: 7.1.4, 7.2.0
Component/s: couchbase-bucket
Labels:

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
Yes

Description

Note: Only affects:

7.1.x - 7.1.4-MP2 and upwards
7.2.0
(There's no "7.1.4-MP2" version in Jira, only 7.1.4, so cannot express this directly.)

The kv_ep_data_read_failed stat is used by ns_server to detect sustained disk errors and fail over the node.

We increment the stat when a read fails but also when we ask the storage engine for a document which does not exist.

https://github.com/couchbase/kv_engine/blob/5e46b22da75d32691cfe5ab3c4dd8a6bafed3184/engines/ep/src/kvstore/magma-kvstore/magma-kvstore.cc#LL956C2-L956C2

We only do that for single document reads, and not for batched reads (queued BGFetches).

While this behaviour exists since 7.0.0, before 7.1.5,/7.2.0, BGFetches for expired documents were queued and used a different code path and so even if the document was not found, we wouldn't increment the stat.

However, as of ~~MB-53898~~ with compaction_expiry_fetch_inline=true (default) we now use the single document read code path which does report no_such_key as a read failure.

Under Magma specifically, compaction can see a version of a document in an SST, which has already been expired and had it's tombstone purged away, when the document tombstones was seen another SST, resulting in no_such_key.

Given the stat is used to trigger failover, we want to only increment it when an actual IO failure occurs.

This issue has only been observed when using the Magma storage engine.

Issue	Resolution
A spurious auto-failover could happen when Magma compaction visited a TTL’d document that was already deleted.	Document not found does not now increment the number of read failures.

Attachments

Issue Links

is triggered by

MB-53898 When compaction performs expiry of documents it can timeout front end reads

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-57609
#	Subject	Branch	Project	Status	CR	V
193173,3	MB-57609: Do not increment numGetFailure for document not found	7.1.x	kv_engine	Status: MERGED	+2	+1
193523,1	[BP] MB-57609: Do not increment numGetFailure for document not found	7.1.4	kv_engine	Status: ABANDONED	+2	-1
193525,1	MB-57609: Merge remote-tracking branch 'couchbase/7.1.x' into neo	neo	kv_engine	Status: MERGED	+2	+1
196473,1	Merge commit '85aaebe5a' into 'couchbase/master'	master	kv_engine	Status: MERGED	+2	+1

Activity

People

Assignee:: Ashwin Govindarajulu

Reporter:: Vesko Karaganev

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Due:: 28/Jun/23

Created:: 26/Jun/23 4:11 AM

Updated:: 18/Sep/23 7:03 AM

Resolved:: 04/Aug/23 9:00 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 4 closed Gerrit changes

Hide There are 4 closed Gerrit changes

MB-57609: Do not increment numGetFailure for document not found: Gerrit Review:

[BP] MB-57609: Do not increment numGetFailure for document not found: Gerrit Review:

MB-57609: Merge remote-tracking branch 'couchbase/7.1.x' into neo: Gerrit Review:

Merge commit '85aaebe5a' into 'couchbase/master': Gerrit Review:

Spurious auto-failover possible if Magma compaction visits a TTL'd document which has already been deleted

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty