Spurious auto-failover possible if Magma compaction visits a TTL'd document which has already been deleted

Description

Note: Only affects:

7.1.x - 7.1.4-MP2 and upwards
7.2.0
(There's no "7.1.4-MP2" version in Jira, only 7.1.4, so cannot express this directly.)

The kv_ep_data_read_failed stat is used by ns_server to detect sustained disk errors and fail over the node.

We increment the stat when a read fails but also when we ask the storage engine for a document which does not exist.

https://github.com/couchbase/kv_engine/blob/5e46b22da75d32691cfe5ab3c4dd8a6bafed3184/engines/ep/src/kvstore/magma-kvstore/magma-kvstore.cc#LL956C2-L956C2

We only do that for single document reads, and not for batched reads (queued BGFetches).

While this behaviour exists since 7.0.0, before 7.1.5,/7.2.0, BGFetches for expired documents were queued and used a different code path and so even if the document was not found, we wouldn't increment the stat.

However, as of with compaction_expiry_fetch_inline=true (default) we now use the single document read code path which does report no_such_key as a read failure.

Under Magma specifically, compaction can see a version of a document in an SST, which has already been expired and had it's tombstone purged away, when the document tombstones was seen another SST, resulting in no_such_key.

Given the stat is used to trigger failover, we want to only increment it when an actual IO failure occurs.

This issue has only been observed when using the Magma storage engine.

Issue	Resolution
A spurious auto-failover could happen when Magma compaction visited a TTL’d document that was already deleted.	Document not found does not now increment the number of read failures.

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Linked issues

is triggered by

MB-53898

When compaction performs expiry of documents it can timeout front end reads

Activity

Show:

CB robot September 2, 2023 at 8:30 AM

Build capella-analytics-1.0.0-1008 contains kv_engine commit 6bcf6a9 with commit message:
: Do not increment numGetFailure for document not found

CB robot September 2, 2023 at 8:30 AM

Build capella-analytics-1.0.0-1008 contains kv_engine commit cff1ebb with commit message:
: Merge remote-tracking branch 'couchbase/7.1.x' into neo

CB robot September 2, 2023 at 8:30 AM

Build capella-analytics-1.0.0-1008 contains kv_engine commit 85aaebe with commit message:
Merge "MB-57609: Merge remote-tracking branch 'couchbase/7.1.x' into neo" into neo

CB robot September 1, 2023 at 1:29 PM

Build couchbase-server-8.0.0-1392 contains kv_engine commit 6bcf6a9 with commit message:
: Do not increment numGetFailure for document not found

CB robot September 1, 2023 at 1:29 PM

Build couchbase-server-8.0.0-1392 contains kv_engine commit cff1ebb with commit message:
: Merge remote-tracking branch 'couchbase/7.1.x' into neo

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Ashwin Govindarajulu
Reporter
Vesko Karaganev
Is this a Regression?
Yes
Triage
Untriaged
Issue Impact
external
Due date
Jun 28, 2023
Story Points
0
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created June 26, 2023 at 11:11 AM

Updated March 21, 2025 at 2:50 AM

Resolved August 4, 2023 at 4:00 PM

Configure

Instabug

Spurious auto-failover possible if Magma compaction visits a TTL'd document which has already been deleted

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Linked issues

is triggered by

Activity

CB robot September 2, 2023 at 8:30 AM

CB robot September 2, 2023 at 8:30 AM

CB robot September 2, 2023 at 8:30 AM

CB robot September 1, 2023 at 1:29 PM

CB robot September 1, 2023 at 1:29 PM

DetailsAssigneeAshwin GovindarajuluAshwin GovindarajuluReporterVesko KaraganevVesko KaraganevIs this a Regression?YesTriageUntriagedIssue ImpactexternalDue dateJun 28, 2023Story Points0PriorityCriticalInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Issue Impact

Due date

Story Points

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Details
Assignee
Ashwin Govindarajulu
Reporter
Vesko Karaganev
Is this a Regression?
Yes
Triage
Untriaged
Issue Impact
external
Due date
Jun 28, 2023
Story Points
0
Priority
Critical
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support