Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
6.6.3, 7.1.4, 7.2.0
-
None
Description
It's not trivially possible from a user perspective to esimate how long it should take from the point of a disk failure til an auto-failover is triggered if enabled to do so in the auto-failover configuration. This is because the disk read/write auto-failover timeout is:
a) disjoint from the "normal" auto-failover timeout
b) has a 60% "disk issue threshold" which effectively uses 60% of the timeout value configured. This parameter is configurable only via diag/eval
Point b is explained in documentation - https://docs.couchbase.com/server/current/learn/clusters-and-availability/automatic-failover.html#configuring-auto-failover. Point a is ambiguous at best.
A couple of worked examples:
1) auto-failover timeout = 5s, disk auto-failover timeout = 10s => ~11s overall timeout This can be observed in this showfast test taking 14-15s with the additional context that it currently takes ~3s to pass statuses from one failing monitor up to the auto-failover module in the first instance (plus the observation of the failure in memcached).
http://showfast.sc.couchbase.com/#/timeline/Linux/reb/failover/all#reb_failover_100M_dgm_kv_disk_hestia
2) auto-failover timeout = 1s, disk auto-failover timeout = 5s => ~4s overall timeout
3) auto-failover timeout = 60s, disk auto-failover timeout = 5s => ~63s overall timeout
It would perhaps be more ideal if the main auto-failover timeout took into consideration an estimated failure time such that the auto-failover timeout could be effectively ignored. Our examples would then be as follows:
1) auto-failover timeout = 5s, disk auto-failover timeout = 10s => 60% of 10s = 6s
2) auto-failover timeout = 1s, disk auto-failover timeout = 5s => 60% of 5s = 3s
3) auto-failover timeout = 60s, disk auto-failover timeout = 5s => 60% of 5s = 3s
This also relates to MB-48412. Such a solution would likely solve the asynchronicity problem in the health monitors in a relatively simple way.
Attachments
Issue Links
- relates to
-
MB-48412 shorten the time between observing the missing heartbeat to the autofailover
- Open