Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46131

Force multiple failover dialog when multiple nodes are unresponsive and user attempts to failover one of them

    XMLWordPrintable

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.0.0
    • Morpheus
    • UI
    • Centos 7 64 bit; CB EE 7.0.0-5085
    • 1

    Description

      Summary:
      The preferred way to failover when multiple nodes are unresponsive is to failover multiple all of them at once(multiple nodes failover). So the ask here is if it is possible to force the multiple failover dialog on UI when the user attempts to failover one of the unresponsive nodes (by clicking on failover next to the server) using the failover option against the server. 
      (Note that this is not a quorum failover)

      Elaborating the current behaviour with an example
      1. Create a 5 node server .215, .217, .219, .237, .90
      2. Load travel-sample with 3 replicas
      3. Stop server on .217 and .219. to make these 2 nodes unresponsive.
      So here's what happens currently when the user attempts to failover them individually one by one (instead of failing them over both together)

      on UI:
      UI didn't return any response and it seemed like it was processing the failover indefinitely without a response for a long time.

      REST API:
      Returns a response of unexpected server error.

      on ns_server_error.log

      [ns_server:error,2021-05-05T00:26:40.393-07:00,ns_1@172.23.105.215:<0.16783.3>:ns_doctor:wait_statuses_loop:251]Couldn't get statuses for ['ns_1@172.23.105.219']
      [ns_server:error,2021-05-05T00:26:40.393-07:00,ns_1@172.23.105.215:<0.16155.3>:menelaus_util:reply_server_error:206]Server error during processing: ["web request failed",
                                       {path,"/pools/default"},
                                       {method,'POST'},
                                       {type,error},
                                       {what,
                                        {badmatch,
                                         {error,{timeout,['ns_1@172.23.105.219']}}}},
                                       {trace,
                                        [{menelaus_web_pools,
                                          do_validate_memory_quota,4,
                                          [{file,"src/menelaus_web_pools.erl"},
                                           {line,407}]},
                                         {lists,foldl,3,
                                          [{file,"lists.erl"},{line,1263}]},
                                         {validator,handle,4,
                                          [{file,"src/validator.erl"},{line,79}]},
                                         {menelaus_web_pools,
                                          do_handle_pool_settings_post_loop,2,
                                          [{file,"src/menelaus_web_pools.erl"},
                                           {line,451}]},
                                         {request_throttler,do_request,3,
                                          [{file,"src/request_throttler.erl"},
                                           {line,58}]},
                                         {menelaus_util,handle_request,2,
                                          [{file,"src/menelaus_util.erl"},
                                           {line,217}]},
                                         {mochiweb_http,headers,6,
                                          [{file,
                                            "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/mochiweb/mochiweb_http.erl"},
                                           {line,150}]},
                                         {proc_lib,init_p_do_apply,3,
                                          [{file,"proc_lib.erl"},{line,249}]}]}]
      [ns_server:error,2021-05-05T00:27:52.231-07:00,ns_1@172.23.105.215:<0.21777.3>:rebalance:progress:147]Couldn't reach ns_rebalance_observer
      [ns_server:error,2021-05-05T00:28:02.609-07:00,ns_1@172.23.105.215:<0.21641.3>:ns_rebalance_observer:generic_get_call:108]Unexpected exception {exit,
                               {noproc,
                                   {gen_server,call,
                                       [{via,leader_registry,ns_rebalance_observer},
                                        get_aggregated_progress,10000]}}}
      [ns_server:error,2021-05-05T00:28:02.609-07:00,ns_1@172.23.105.215:<0.21641.3>:rebalance:progress:147]Couldn't reach ns_rebalance_observer
      [ns_server:error,2021-05-05T00:28:13.282-07:00,ns_1@172.23.105.215:<0.28029.3>:ns_rebalance_observer:generic_get_call:108]Unexpected exception {exit,
                               {noproc,
                                   {gen_server,call,
                                       [{via,leader_registry,ns_rebalance_observer},
                                        get_aggregated_progress,10000]}}}

      (Note that failing them over one by one may still work if the bucket didn't have 3 replicas I think)

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          dfinlay Dave Finlay added a comment - - edited

          Thanks Sumedh Basarkod. This behavior:

          UI didn't return any response and it seemed like it was processing the failover indefinitely without a response for a long time.

          sounds like a bug that we should fix.

          I think the idea of popping the multi-node failover dialog when the failover button associated with a single node is clicked is a good one. If we were to do this, I'd pop the multi-node failover dialog always and immediately select the node that's identified to be failed over. The issue is that the multi-node failover dialog currently doesn't let users do a graceful failover - the only option you get is hard. So, it's a bit more work to unify the flow.

          I think we should:

          1. fix the bug where the UI hangs infinitely
          2. move this improvement to CC.Next to capture the idea of unifying the failover flows

          I'll move this ticket out. Sumedh Basarkod: would you mind filing the indefinite waiting bug against the UI?

          dfinlay Dave Finlay added a comment - - edited Thanks Sumedh Basarkod . This behavior: UI didn't return any response and it seemed like it was processing the failover indefinitely without a response for a long time. sounds like a bug that we should fix. I think the idea of popping the multi-node failover dialog when the failover button associated with a single node is clicked is a good one. If we were to do this, I'd pop the multi-node failover dialog always and immediately select the node that's identified to be failed over. The issue is that the multi-node failover dialog currently doesn't let users do a graceful failover - the only option you get is hard. So, it's a bit more work to unify the flow. I think we should: fix the bug where the UI hangs infinitely move this improvement to CC.Next to capture the idea of unifying the failover flows I'll move this ticket out. Sumedh Basarkod : would you mind filing the indefinite waiting bug against the UI?

          Sure, Dave. Opened https://issues.couchbase.com/browse/MB-46158 to track the UI issue

          sumedh.basarkod Sumedh Basarkod added a comment - Sure, Dave. Opened https://issues.couchbase.com/browse/MB-46158  to track the UI issue

          People

            dfinlay Dave Finlay
            sumedh.basarkod Sumedh Basarkod
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty