Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49849

MultiNodeFailover: Failover triggered in the node immediately after restarting the service after the actual failover timeout period due to the reason "failed to acquire lease"

    XMLWordPrintable

Details

    Description

      Build: 7.1.0-1787

      Scenario:

      • 7 node cluster
      • Couchbase bucket with replicas=2
      • Set auto-failover with max_events=10 and timeout=10
      • Stop couchbase service on all index nodes (172.23.105.245[index+query],  172.23.100.15[index], 172.23.100.13[index+backup])
      • Failover was attempted but not done with reason 

        Number of remaining nodes that are running index service is 0. You need at least 1 nodes.

      • Bring back all 3 nodes back by starting the couchbase-service

      Observation:

      Master node (.155) saw the node was up but failed to acquire lease from the node(.245) which has resulted in the failover procedure.

      ns_server.info.log of 172.23.105.155:

      [user:info,2021-11-30T23:18:35.675-08:00,ns_1@172.23.105.155:ns_node_disco<0.435.0>:ns_node_disco:handle_info:177]Node 'ns_1@172.23.105.155' saw that node 'ns_1@172.23.105.245' came up. Tags: []
      [ns_server:info,2021-11-30T23:18:35.677-08:00,ns_1@172.23.105.155:ns_node_disco_events<0.434.0>:ns_node_disco_log:handle_event:40]ns_node_disco_log: nodes changed: ['ns_1@172.23.100.13','ns_1@172.23.100.14',
                                         'ns_1@172.23.100.15','ns_1@172.23.105.155',
                                         'ns_1@172.23.105.211',
                                         'ns_1@172.23.105.212',
                                         'ns_1@172.23.105.213',
                                         'ns_1@172.23.105.244',
                                         'ns_1@172.23.105.245']
      ns_server:warn,2021-11-30T23:18:35.678-08:00,ns_1@172.23.105.155:<0.14497.249>:leader_lease_acquire_worker:handle_exception:244]Failed to acquire lease from 'ns_1@172.23.105.245': {exit,
                                                           {noproc,
                                                            {gen_server,call,
                                                             [{leader_lease_agent,
                                                               'ns_1@172.23.105.245'},
                                                              {acquire_lease,
                                                               'ns_1@172.23.105.155',
                                                               <<"1486f6a8ae4cd211d1a68767e369fae7">>,
                                                               [{timeout,15000},
                                                                {period,15000}]},
                                                              infinity]}}}
       
      [user:info,2021-11-30T23:18:36.534-08:00,ns_1@172.23.105.155:<0.13944.249>:failover:orchestrate:150]Starting failing over ['ns_1@172.23.105.245']
      [user:info,2021-11-30T23:18:36.535-08:00,ns_1@172.23.105.155:<0.11668.0>:ns_orchestrator:handle_start_failover:1658]Starting failover of nodes ['ns_1@172.23.105.245']. Operation Id = e98a8193622d9ae1c8f83c644603b396
       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          No work has yet been logged on this issue.

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty