Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
master
-
March-June 24
Description
In 5.5.x we added the ability to auto-failover based on disk write failures which is useful when an underlying drive breaks.
However there are certain failure modes of disks where no errors are propagated to the application performing the write.
Instead the write just takes an infinite amount of time and seems to hang indefinitely.
On the Couchbase Server side, this results in the disk write queue filling up, disk fetches potentially not completing and overall leaving the node in an unhealthy state.
It would be useful if auto-failover could be augmented to also be able to react in this case too.
Unfortunately I understand that there are no real good metrics as to what 'slow' really means, or what constitutes a 'high' disk write queue for a given workload.
Perhaps some kind of timeout could be added to couchstore writes within KV, even if this very high (a few minutes), after which KV reports a write failure.
This would allow the existing failover mechanism to 'just work', although obviously there are other reasons we may choose to not do this.
Attachments
Issue Links
- is parent task of
-
DOC-12041 Doc for auto-failover for slow/hanging disks
- Open
-
DOC-12073 Doc: Support auto-failover for exceptionally slow/hanging disks
- Open
- relates to
-
MB-51064 Detect & Alert users on non-homogenous-disk-performance across nodes in the cluster
- Open
- blocks
-
AV-77005 Loading...
- links to