Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45433

UI should allow failing over inactive nodes when allowUnsafe is true

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • UI
    • None
    • Untriaged
    • 1
    • Unknown

    Description

      After changes in MB-45110 . We should allow selection of inactive(either inactiveAdded or inactiveFailed) nodes for multi-node failover during unsafe failover. 

      Steps to Reproduce:
      1. Create a 5 node cluster: n0 n1 n2 n3 n4
      2. Stop-server on n4 and when the node becomes unresponsive, fail it over, but don't rebalance it out yet.

      3. Now stop server on n2, n3 nodes.

      Now it appears that we can't get the unresponsive nodes from steps 2 and 3 out of the cluster.
      From the UI we can't quorum failover n2 and n3 as well as n4. 

      In this case the UI should allow failover of node n4(inactive node) along with node n2 and n3(active nodes) when we get the message, "Cannot safely perform a failover at the moment". 

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            rob.ashcom Rob Ashcom added a comment -

            Dave Finlay Can you define this a little better?

            rob.ashcom Rob Ashcom added a comment - Dave Finlay  Can you define this a little better?

            Rob Ashcom : Updated the description. 

            Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - Rob Ashcom  : Updated the description. 
            dfinlay Dave Finlay added a comment -

            Thanks Abhi - was just about to get to this item on the todo list.

            With the fix off MB-45110 we now allow users to unsafely failover an already (regularly) failed over node. Previously we didn't allow this - if you tried to unsafe failover a node that was already failed over, ns_server would complain that the node was already failed over and the failover wouldn't get processed. (This is why already failed over nodes don't show in the multi-node failover dialog.)

            In one sense adjusting the UI to allow for this might be as simple as adding all nodes to the multi-node failover dialog.

            However, there's a complicating factor here which is that this we only allow re-failing over a node when the second failover is unsafe. And actually I think it's important for us to distinguish between regular failover and unsafe failover to a greater degree than we do today. For instance in the multi-node failover dialog perhaps we do something like the following:

            • When the user opens the dialog it looks as it does today and doesn't include failed over nodes
            • If the user clicks failover and the failover times out as there's a majority of nodes down, then instead of prompting the user with a "Confirm Failover" dialog we return the user to the multi-node failover dialog which now has a changed title of "Unsafe Failover" and now we include the previously failed over nodes.
            • We will also need to link to documentation on what unsafe failover means and what the user needs to do to help ensure that they are doing the right thing.

            It would probably be a good idea to discuss this and think about options.

            CC: Pavel Blagodov

            dfinlay Dave Finlay added a comment - Thanks Abhi - was just about to get to this item on the todo list. With the fix off MB-45110 we now allow users to unsafely failover an already (regularly) failed over node. Previously we didn't allow this - if you tried to unsafe failover a node that was already failed over, ns_server would complain that the node was already failed over and the failover wouldn't get processed. (This is why already failed over nodes don't show in the multi-node failover dialog.) In one sense adjusting the UI to allow for this might be as simple as adding all nodes to the multi-node failover dialog. However, there's a complicating factor here which is that this we only allow re-failing over a node when the second failover is unsafe. And actually I think it's important for us to distinguish between regular failover and unsafe failover to a greater degree than we do today. For instance in the multi-node failover dialog perhaps we do something like the following: When the user opens the dialog it looks as it does today and doesn't include failed over nodes If the user clicks failover and the failover times out as there's a majority of nodes down, then instead of prompting the user with a "Confirm Failover" dialog we return the user to the multi-node failover dialog which now has a changed title of "Unsafe Failover" and now we include the previously failed over nodes. We will also need to link to documentation on what unsafe failover means and what the user needs to do to help ensure that they are doing the right thing. It would probably be a good idea to discuss this and think about options. CC: Pavel Blagodov
            dfinlay Dave Finlay added a comment -

            Rob and I have discussed; he will do the next step of figuring out a good way to get the user through regular failover to unsafe failover including allowing the user to select already failed-over nodes.

            dfinlay Dave Finlay added a comment - Rob and I have discussed; he will do the next step of figuring out a good way to get the user through regular failover to unsafe failover including allowing the user to select already failed-over nodes.
            rob.ashcom Rob Ashcom added a comment -
            1. User tries regular hard failover
            2. Gets error back from ns_server
            3. Instead of confirmation dialog, go to --->  multi-node dialog again with ALL nodes present + warning text
            4. If user continues, show confirmation dialog
            rob.ashcom Rob Ashcom added a comment - User tries regular hard failover Gets error back from ns_server Instead of confirmation dialog, go to --->  multi-node dialog again with ALL nodes present + warning text If user continues, show confirmation dialog
            rob.ashcom Rob Ashcom added a comment -

            We should probably alter the steps outlined above to just remove that final confirmation dialog. See the new screenshot for how the multi-node failover dialog changes in this case – there's a sufficient amount of warning going on there.

            rob.ashcom Rob Ashcom added a comment - We should probably alter the steps outlined above to just remove that final confirmation dialog. See the new screenshot for how the multi-node failover dialog changes in this case – there's a sufficient amount of warning going on there.

            Build couchbase-server-7.0.0-4987 contains ns_server commit 8c17267 with commit message:
            MB-45433: allow failing over inactive nodes when allowUnsafe

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4987 contains ns_server commit 8c17267 with commit message: MB-45433 : allow failing over inactive nodes when allowUnsafe

            Verified on 4993

            Repeated all above mentioned steps in 4 nodes cluster and observed error as:

            Selected failover but no failover happened. Then selected failover from right top corner.

            Observed first failover node in the list.

             

            Failover Node Unsafe Mode selected with all first 3 nodes.

            So UI allowed failover of node (inactive node) along with other 2 nodes.

             

            deepika.verma Deepika Verma (Inactive) added a comment - Verified on 4993 Repeated all above mentioned steps in 4 nodes cluster and observed error as: Selected failover but no failover happened. Then selected failover from right top corner. Observed first failover node in the list.   Failover Node Unsafe Mode selected with all first 3 nodes. So UI allowed failover of node (inactive node) along with other 2 nodes.  

            People

              pavel Pavel Blagodov
              Abhijeeth.Nuthan Abhijeeth Nuthan
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty