Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61881

Ticks for subsequent node down does not seem to happen during ongoing failover

    XMLWordPrintable

Details

    Description

      Steps
      1 . create a 6 node cluster
      172.23.136.104 - data
      172.23.136.106 - data
      172.23.136.109 - data
      172.23.136.110 - query
      172.23.136.114 - index
      172.23.136.115 - data

      2. hit the api that delays autofailover by 1 min

       curl -k https://Administrator:password@localhost:18091/diag/eval -X POST -d 'testconditions:set(failover_start, {delay,60000 })'

      3. set Auto-failover timeout - 60 and max nodes - 2 

      4. bring down .104 , autofailover starts for .104 

      user:info,2024-05-13T22:48:31.455-07:00,ns_1@172.23.136.110:<0.17382.6>:failover:orchestrate:172]Starting failing over ['ns_1@172.23.136.104']
      

      5. as autofailover is delayed , bring down second node .109 in around middle of ongoing
      failover after around ~30 seconds passed

      [user:warn,2024-05-13T22:49:02.053-07:00,ns_1@172.23.136.106:ns_node_disco<0.7214.0>:ns_node_disco:handle_info:169]Node 'ns_1@172.23.136.106' saw that node 'ns_1@172.23.136.109' went down. Details: [{nodedown_reason,
                                                                                           shutdown}]
      

      6. autofailover for .104 fails as it might not be able to active replicas on .109

      user:error,2024-05-13T22:49:31.524-07:00,ns_1@172.23.136.110:<0.8883.0>:ns_orchestrator:log_rebalance_completion:1661]Failover exited with reason {failover_failed,"gamesim-sample",
                                      "Failed to get failover info for bucket \"gamesim-sample\": ['ns_1@172.23.136.109']"}.
      Rebalance Operation Id = 14aa0cd61ddc898532fcb445e44e14fc
      

      now expected next failover of .104 and .109 in around 30 seconds as timeout is set for 60 seconds and ticks for .109 must have been going on while AFO is delayed but next failover took around 60 more seconds.

      [user:info,2024-05-13T22:50:33.094-07:00,ns_1@172.23.136.110:<0.25124.6>:failover:orchestrate:184]Failed over ['ns_1@172.23.136.104','ns_1@172.23.136.109']: ok
      [ns_server:info,2024-05-13T22:50:33.095-07:00,ns_1@172.23.136.110:leader_quorum_nodes_manager<0.8852.0>:leader_quorum_nodes_manager:handle_set_quorum_nodes:121]Updating quorum nodes.
      Old quorum nodes: ['ns_1@172.23.136.110','ns_1@172.23.136.104',
                         'ns_1@172.23.136.114','ns_1@172.23.136.115',
                         'ns_1@172.23.136.106','ns_1@172.23.136.109']
      New quorum nodes: ['ns_1@172.23.136.110','ns_1@172.23.136.114',
                         'ns_1@172.23.136.115','ns_1@172.23.136.106']
      [ns_server:error,2024-05-13T22:50:33.105-07:00,ns_1@172.23.136.110:leader_quorum_nodes_manager<0.8852.0>:ns_config_rep:synchronize_remote:356]Failed to synchronize config to some nodes: 
      [{'ns_1@172.23.136.109',
           {exit,

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            pulkit.matta Pulkit Matta
            pulkit.matta Pulkit Matta
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty