Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48412

shorten the time between observing the missing heartbeat to the autofailover

    XMLWordPrintable

Details

    Description

      As seen in CBSE-10704

      We promise 5 seconds auto failover timeout but it takes 9 seconds for the shut down node to be failed over.

      The node is considered to be down if the latest heartbeat was received 2 seconds ago. So if the mode went down exactly after heartbeat was sent, you will have 2 seconds lag on detecting that it is down.

      Then unfortunately we have series of internal monitors that refresh their status once a second requesting status from other monitors that also refresh their status once a second. Some unfortunate alignment of such events can account for other 2 seconds of delay.

      And only after the top level monitor discovers that the node is down we begin the count down for the configured auto failover time out.

      What can be improved:
      1. Reduce asynchronicity in health monitors. Do all the information gathering in one tick, instead of doing independent ticks on each monitor
      2. Account for 2 seconds passed since last heartbeat when we start count down to auto failover.
      3. Consider using ns_node_disco info as a backup way to see if other node is down.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-48412
          # Subject Branch Project Status CR V

          Activity

            People

              ben.huddleston Ben Huddleston
              artem Artem Stemkovski
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are 2 open Gerrit changes

                  PagerDuty