Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-34155

Support Auto-failover for exceptionally slow/hanging disks

    XMLWordPrintable

Details

    • March-June 24

    Description

      In 5.5.x we added the ability to auto-failover based on disk write failures which is useful when an underlying drive breaks.

      However there are certain failure modes of disks where no errors are propagated to the application performing the write.
      Instead the write just takes an infinite amount of time and seems to hang indefinitely.

      On the Couchbase Server side, this results in the disk write queue filling up, disk fetches potentially not completing and overall leaving the node in an unhealthy state.

      It would be useful if auto-failover could be augmented to also be able to react in this case too.

      Unfortunately I understand that there are no real good metrics as to what 'slow' really means, or what constitutes a 'high' disk write queue for a given workload.
      Perhaps some kind of timeout could be added to couchstore writes within KV, even if this very high (a few minutes), after which KV reports a write failure.
      This would allow the existing failover mechanism to 'just work', although obviously there are other reasons we may choose to not do this.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              vesko.karaganev Vesko Karaganev
              matt.carabine Matt Carabine (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty