Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: master
Component/s: couchbase-bucket, ns_server, storage-engine
Labels:
- need-doc
- pm-7.6.x

Sprint:
March-June 24

Description

In 5.5.x we added the ability to auto-failover based on disk write failures which is useful when an underlying drive breaks.

However there are certain failure modes of disks where no errors are propagated to the application performing the write.
Instead the write just takes an infinite amount of time and seems to hang indefinitely.

On the Couchbase Server side, this results in the disk write queue filling up, disk fetches potentially not completing and overall leaving the node in an unhealthy state.

It would be useful if auto-failover could be augmented to also be able to react in this case too.

Unfortunately I understand that there are no real good metrics as to what 'slow' really means, or what constitutes a 'high' disk write queue for a given workload.
Perhaps some kind of timeout could be added to couchstore writes within KV, even if this very high (a few minutes), after which KV reports a write failure.
This would allow the existing failover mechanism to 'just work', although obviously there are other reasons we may choose to not do this.

Attachments

Issue Links

is parent task of

DOC-12041 Doc for auto-failover for slow/hanging disks

Open

DOC-12073 Doc: Support auto-failover for exceptionally slow/hanging disks

Open

relates to

MB-51064 Detect & Alert users on non-homogenous-disk-performance across nodes in the cluster

Open

blocks: AV-77005 Loading...

links to

PRD

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Vesko Karaganev

Reporter:: Matt Carabine (Inactive)

Votes:: 2 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 13/May/19 9:33 AM

Updated:: 17/Apr/24 4:31 PM

Gerrit Reviews

There are no open Gerrit changes

Support Auto-failover for exceptionally slow/hanging disks

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty