Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58264

Addition of disk read/write failure timeout to auto-failover timeout is unintuitive

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 6.6.3, 7.1.4, 7.2.0
    • ns_server
    • None

    Description

      It's not trivially possible from a user perspective to esimate how long it should take from the point of a disk failure til an auto-failover is triggered if enabled to do so in the auto-failover configuration. This is because the disk read/write auto-failover timeout is:

      a) disjoint from the "normal" auto-failover timeout
      b) has a 60% "disk issue threshold" which effectively uses 60% of the timeout value configured. This parameter is configurable only via diag/eval

      Point b is explained in documentation - https://docs.couchbase.com/server/current/learn/clusters-and-availability/automatic-failover.html#configuring-auto-failover. Point a is ambiguous at best.

      A couple of worked examples:

      1) auto-failover timeout = 5s, disk auto-failover timeout = 10s => ~11s overall timeout This can be observed in this showfast test taking 14-15s with the additional context that it currently takes ~3s to pass statuses from one failing monitor up to the auto-failover module in the first instance (plus the observation of the failure in memcached). 
      http://showfast.sc.couchbase.com/#/timeline/Linux/reb/failover/all#reb_failover_100M_dgm_kv_disk_hestia
      2) auto-failover timeout = 1s, disk auto-failover timeout = 5s => ~4s overall timeout
      3) auto-failover timeout = 60s, disk auto-failover timeout = 5s => ~63s overall timeout

      It would perhaps be more ideal if the main auto-failover timeout took into consideration an estimated failure time such that the auto-failover timeout could be effectively ignored. Our examples would then be as follows:

      1) auto-failover timeout = 5s, disk auto-failover timeout = 10s => 60% of 10s = 6s
      2) auto-failover timeout = 1s, disk auto-failover timeout = 5s => 60% of 5s = 3s
      3) auto-failover timeout = 60s, disk auto-failover timeout = 5s => 60% of 5s = 3s

      This also relates to MB-48412. Such a solution would likely solve the asynchronicity problem in the health monitors in a relatively simple way.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ben.huddleston Ben Huddleston
              ben.huddleston Ben Huddleston
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty