Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-20581

Auto-failover based on failure to write to disk

    XMLWordPrintable

Details

    Description

      A customer has requested that we consider including disk write issues in the auto-failover process. For example, a number of commit failures, a period of time without a successful commit, or perhaps a DWQ that has been increasing for too long.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            pvarley Patrick Varley added a comment - - edited

            Poonam Dhavale While doing the Couchbase-cli changes for this feature. I noticed that failoverOnDataDiskIssues[enabled] and failoverOnDataDiskIssues[timePeriod] does not behave like the original options. When auto-failover is disabled it retains the timeout value that has been set.

            With the new options when auto-failover is disable they reset to the default values. This will cause confusion in the field as there are many time when a user/Support chooses to disable auto-failover. When they come to re-enabling auto-failover they're force to remember or workout what the settings were.

            I have reopened this ticket to review this behaviour, maybe it is a good idea to default failoverOnDataDiskIssues[enabled] to false but I think it would be helpful to retain the failoverOnDataDiskIssues[timePeriod] value.

            pvarley Patrick Varley added a comment - - edited Poonam Dhavale While doing the Couchbase-cli changes for this feature. I noticed that failoverOnDataDiskIssues [enabled] and failoverOnDataDiskIssues [timePeriod] does not behave like the original options. When auto-failover is disabled it retains the timeout value that has been set. With the new options when auto-failover is disable they reset to the default values. This will cause confusion in the field as there are many time when a user/Support chooses to disable auto-failover. When they come to re-enabling auto-failover they're force to remember or workout what the settings were. I have reopened this ticket to review this behaviour, maybe it is a good idea to default failoverOnDataDiskIssues [enabled] to false but I think it would be helpful to retain the failoverOnDataDiskIssues [timePeriod] value.

            While writing the manual pages for the couchbase-cli around the auto-failover on the Data Service disk issues, I was wondering what the behaviour is when the failoverOnDataDiskIssues[timePeriod] is set to 10 seconds and the timeout is set to 120 seconds. When does the failover happen and how? For example if the disk failures only happened for the 1st 60 seconds, would the node be failed over there and end or when the 120 second threshold is met?

            pvarley Patrick Varley added a comment - While writing the manual pages for the couchbase-cli around the auto-failover on the Data Service disk issues, I was wondering what the behaviour is when the failoverOnDataDiskIssues [timePeriod] is set to 10 seconds and the timeout is set to 120 seconds. When does the failover happen and how? For example if the disk failures only happened for the 1st 60 seconds, would the node be failed over there and end or when the 120 second threshold is met?

            Hi Patrick,

            • Regarding when does the failover happen and how?
              • I posted links to the design/functional docs for all 3 auto-failover features in the CLI ticket MB-26839
              • The design doc has the necessary details but quick answer to your question: Auto-failover will occur if the disk failure persists for (auto-failover timeout) + (auto-failover on disk issue time period) + 2-3 sec grace period. In your example, since the disk failure did not persist for the duration of the auto failover timeout, there will be no auto-failover.
            • Regarding retaining the setting for “auto-failover on disk issue time period”: Yes, we can make the change to retain it across the "auto-failover on disk issue” enable/disable.
              • To make sure we are on the same page - Say auto-failover is enabled, timeout set to 10s, auto-failover on disk issue is enabled and its time period set to 30s.
                • User disables “auto-failover on disk issue” and then later re-enables it. The time-period will default to 30s.
                • User disables “auto-failover” and then later re-enables it. "auto-failover on disk issue” will remain disabled. If user then enables "auto-failover on disk issue”, then the time-period will default to 30s.
            poonam Poonam Dhavale added a comment - Hi Patrick, Regarding when does the failover happen and how? I posted links to the design/functional docs for all 3 auto-failover features in the CLI ticket  MB-26839 The design doc has the necessary details but quick answer to your question: Auto-failover will occur if the disk failure persists for (auto-failover timeout) + (auto-failover on disk issue time period) + 2-3 sec grace period. In your example, since the disk failure did not persist for the duration of the auto failover timeout, there will be no auto-failover. Regarding retaining the setting for “auto-failover on disk issue time period”: Yes, we can make the change to retain it across the "auto-failover on disk issue” enable/disable. To make sure we are on the same page - Say auto-failover is enabled, timeout set to 10s, auto-failover on disk issue is enabled and its time period set to 30s. User disables “auto-failover on disk issue” and then later re-enables it. The time-period will default to 30s. User disables “auto-failover” and then later re-enables it. "auto-failover on disk issue” will remain disabled. If user then enables "auto-failover on disk issue”, then the time-period will default to 30s.
            • Regarding retaining the setting for “auto-failover on disk issue time period”: Yes, we can make the change to retain it across the "auto-failover on disk issue” enable/disable.
              • To make sure we are on the same page - Say auto-failover is enabled, timeout set to 10s, auto-failover on disk issue is enabled and its time period set to 30s.
                • User disables “auto-failover on disk issue” and then later re-enables it. The time-period will default to 30s.
                • User disables “auto-failover” and then later re-enables it. "auto-failover on disk issue” will remain disabled. If user then enables "auto-failover on disk issue”, then the time-period will default to 30s.

            That behaviour sounds good to me!

            Thank you for all the links.

            pvarley Patrick Varley added a comment - Regarding retaining the setting for “auto-failover on disk issue time period”: Yes, we can make the change to retain it across the "auto-failover on disk issue” enable/disable. To make sure we are on the same page - Say auto-failover is enabled, timeout set to 10s, auto-failover on disk issue is enabled and its time period set to 30s. User disables “auto-failover on disk issue” and then later re-enables it. The time-period will default to 30s. User disables “auto-failover” and then later re-enables it. "auto-failover on disk issue” will remain disabled. If user then enables "auto-failover on disk issue”, then the time-period will default to 30s. That behaviour sounds good to me! Thank you for all the links.
            poonam Poonam Dhavale added a comment - - edited

             

            All the code for this feature has been merged. Created MB-27848 for the item mentioned by Patrick.

            poonam Poonam Dhavale added a comment - - edited   All the code for this feature has been merged. Created MB-27848  for the item mentioned by Patrick.

            People

              poonam Poonam Dhavale
              malarky Chris Malarky
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty