Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: 6.6.3, 7.1.4, 7.2.0
Component/s: ns_server
Labels:
None

Epic Link:
Failover Improvements
Story Points:
0

Description

It's not trivially possible from a user perspective to esimate how long it should take from the point of a disk failure til an auto-failover is triggered if enabled to do so in the auto-failover configuration. This is because the disk read/write auto-failover timeout is:

a) disjoint from the "normal" auto-failover timeout
b) has a 60% "disk issue threshold" which effectively uses 60% of the timeout value configured. This parameter is configurable only via diag/eval

Point b is explained in documentation - https://docs.couchbase.com/server/current/learn/clusters-and-availability/automatic-failover.html#configuring-auto-failover. Point a is ambiguous at best.

A couple of worked examples:

1) auto-failover timeout = 5s, disk auto-failover timeout = 10s => ~11s overall timeout This can be observed in this showfast test taking 14-15s with the additional context that it currently takes ~3s to pass statuses from one failing monitor up to the auto-failover module in the first instance (plus the observation of the failure in memcached).
http://showfast.sc.couchbase.com/#/timeline/Linux/reb/failover/all#reb_failover_100M_dgm_kv_disk_hestia
2) auto-failover timeout = 1s, disk auto-failover timeout = 5s => ~4s overall timeout
3) auto-failover timeout = 60s, disk auto-failover timeout = 5s => ~63s overall timeout

It would perhaps be more ideal if the main auto-failover timeout took into consideration an estimated failure time such that the auto-failover timeout could be effectively ignored. Our examples would then be as follows:

1) auto-failover timeout = 5s, disk auto-failover timeout = 10s => 60% of 10s = 6s
2) auto-failover timeout = 1s, disk auto-failover timeout = 5s => 60% of 5s = 3s
3) auto-failover timeout = 60s, disk auto-failover timeout = 5s => 60% of 5s = 3s

This also relates to MB-48412. Such a solution would likely solve the asynchronicity problem in the health monitors in a relatively simple way.

Attachments

Issue Links

relates to

MB-48412 shorten the time between observing the missing heartbeat to the autofailover

Open

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Ben Huddleston

Reporter:: Ben Huddleston

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Aug/23 3:05 AM

Updated:: 30/Aug/23 3:29 PM

Gerrit Reviews

There are no open Gerrit changes

Addition of disk read/write failure timeout to auto-failover timeout is unintuitive

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty