Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-29894

Improve the idea of "failover safeness" to better support multi-node failover (and associated UI error messages)

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Critical
    • backlog
    • 5.5.0
    • ns_server
    • None

    Description

      Currently we have a somewhat miscellaneous collection of UI error messages that are served to users when the are performing hard failovers of a single node. E.g. if replications are relatively-up-to-date the message is:

      Warning: Failing over the node will remove it from the cluster and activate a replica.
      Operations currently in flight and not yet replicated, will be lost. Rebalancing will be
      required to add the node back into the cluster. Consider using "Remove" and
      rebalancing instead of Failover, to avoid any loss of data.

      If the replications are behind (or are missing because node is down - probably the node to be failed over) we show the following message:

      Attention: A significant amount of data stored on this node
      does not yet have replica (backup) copies! Failing over the node now will
      irrecoverably lose that data when the incomplete replica is
      activated and this node is removed from the cluster. It is
      recommended to use "Remove" and rebalance to
      safely remove the node without any data loss.

      There are, in addition, a different set of warnings for nodes that don't include the data service which get shown when that node is selected to be failed over.

      None of these error messages are shown in the multi-node failure dialog and given it's prominent placement in the UI and the fact that we want people to use it when failing over multiple nodes, we should have a better set of warning and error messages for it.

      Getting good error messages is complicated by the fact that the nodeStatuses REST API which is used by the UI to get the "failover safeness" information for each node returns information assuming just one node is failed over.

      We need to redesign the protocol between the UI and the server. Perhaps we should add a checkSafety=true query parameter to the controller/failOver REST API and have it return safety information to the client.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            dfinlay Dave Finlay
            dfinlay Dave Finlay
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty