Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62236

Subsequent node fails over even if it comes back up before timeout

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • 7.6.2
    • ns_server
    •  Enterprise Edition 7.6.2 build 3674

    Description

      steps 

      1. create a 6 node cluster

      172.23.104.235, 172.23.104.241, 172.23.104.250, 172.23.136.103, 172.23.136.104, 172.23.96.197

      2. enable autofailover with following settings
      timeout : 90 , max events: 5

      3. enable some delay in the autofailover on .197

      curl -k https://Administrator:password@localhost:18091/diag/eval -X POST -d 'testconditions:set(failover_end, {delay, 120000}).'

      4. bring down data node to trigger autofailover  for .241

      ns_server:info,2024-06-09T22:39:53.095-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:handle_event:670]Skipping janitor in state rebalancing
      [user:info,2024-06-09T22:39:56.544-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:log_rebalance_completion:1661]Failover completed successfully.
      Rebalance Operation Id = 80a2b06c3374f8c4b7593f1629673562
      [ns_server:info,2024-06-09T22:39:56.602-07:00,ns_1@172.23.96.197:leader_registry<0.887.0>:leader_registry:handle_down:286]Process <0.17680.310> registered as 'ns_rebalance_observer' terminated.
      [user:info,2024-06-09T22:39:56.603-07:00,ns_1@172.23.96.197:<0.1538.0>:auto_failover:log_failover_success:662]Node ('ns_1@172.23.104.241') was automatically failed over. Reason: All monitors report node is unhealthy.
      [user:info,2024-06-09T22:39:56.969-07:00,ns_1@172.23.96.197:<0.25807.310>:failover:orchestrate:172]Starting failing over ['ns_1@172.23.136.103']
      

      5. bring down another node 172.23.136.103 while failover is getting triggered but quickly bring it up again before 90 seconds 

      [chronicle:info,2024-06-09T22:38:01.488-07:00,ns_1@172.23.96.197:chronicle_proposer<0.10208.162>:chronicle_proposer:handle_down:1142]Observed agent {chronicle_agent,'ns_1@172.23.136.103'} on peer 'ns_1@172.23.136.103' go down with reason noconnection
      [user:warn,2024-06-09T22:38:01.488-07:00,ns_1@172.23.96.197:ns_node_disco<0.598.0>:ns_node_disco:handle_info:169]Node 'ns_1@172.23.96.197' saw that node 'ns_1@172.23.136.103' went down. Details: [{nodedown_reason,
                                                                                          connection_closed}]
      [chronicle:info,2024-06-09T22:38:01.489-07:00,ns_1@172.23.96.197:chronicle_proposer<0.10208.162>:chronicle_proposer:handle_nodedown:1135]Peer 'ns_1@172.23.136.103' went down: [{nodedown_reason,connection_closed}]
      [ns_server:info,2024-06-09T22:38:01.489-07:00,ns_1@172.23.96.197:ns_node_disco_events<0.596.0>:ns_node_disco_log:handle_event:40]ns_node_disco_log: nodes changed: ['ns_1@172.23.104.235',
                                         'ns_1@172.23.104.250',
                                         'ns_1@172.23.136.104','ns_1@172.23.96.197']
      [ns_server:info,2024-06-09T22:38:03.073-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:handle_event:670]Skipping janitor in state rebalancing
      

      came up logs 

      [user:info,2024-06-09T22:38:15.470-07:00,ns_1@172.23.96.197:ns_node_disco<0.598.0>:ns_node_disco:handle_info:163]Node 'ns_1@172.23.96.197' saw that node 'ns_1@172.23.136.103' came up. Tags: []
      [chronicle:info,2024-06-09T22:38:15.470-07:00,ns_1@172.23.96.197:chronicle_proposer<0.10208.162>:chronicle_proposer:handle_nodeup:1093]Peer 'ns_1@172.23.136.103' came up
      [ns_server:info,2024-06-09T22:38:15.471-07:00,ns_1@172.23.96.197:ns_node_disco_events<0.596.0>:ns_node_disco_log:handle_event:40]ns_node_disco_log: nodes changed: ['ns_1@172.23.104.235',
                                         'ns_1@172.23.104.250',
                                         'ns_1@172.23.136.103',
                                         'ns_1@172.23.136.104','ns_1@172.23.96.197']
      [ns_server:info,2024-06-09T22:38:15.471-07:00,ns_1@172.23.96.197:ns_config_rep<0.615.0>:ns_config_rep:handle_info:258]Replicating config to/from:
      ['ns_1@172.23.136.103']
      

      but eventually it auto fails over with .241 itself

      [user:info,2024-06-09T22:39:57.450-07:00,ns_1@172.23.96.197:<0.25807.310>:failover:orchestrate:184]Failed over ['ns_1@172.23.136.103']: ok
      [ns_server:info,2024-06-09T22:39:57.451-07:00,ns_1@172.23.96.197:leader_quorum_nodes_manager<0.1464.0>:leader_quorum_nodes_manager:handle_set_quorum_nodes:121]Updating quorum nodes.
      Old quorum nodes: ['ns_1@172.23.104.250','ns_1@172.23.136.103',
                         'ns_1@172.23.136.104','ns_1@172.23.104.235',
                         'ns_1@172.23.96.197']
      New quorum nodes: ['ns_1@172.23.104.250','ns_1@172.23.136.104',
                         'ns_1@172.23.104.235','ns_1@172.23.96.197']
      [user:info,2024-06-09T22:39:57.463-07:00,ns_1@172.23.96.197:<0.25807.310>:failover:deactivate_nodes:241]Deactivating failed over nodes ['ns_1@172.23.136.103']
      [user:info,2024-06-09T22:39:57.601-07:00,ns_1@172.23.96.197:<0.1536.0>:ns_orchestrator:log_rebalance_completion:1661]Failover completed successfully.
      Rebalance Operation Id = e4e1005a5888374306dd19fa147f8051
      

      seeing similar behaviour when subsequent node is of different service 

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              pulkit.matta Pulkit Matta
              pulkit.matta Pulkit Matta
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty