Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49849

MultiNodeFailover: Failover triggered in the node immediately after restarting the service after the actual failover timeout period due to the reason "failed to acquire lease"

    XMLWordPrintable

Details

    Description

      Build: 7.1.0-1787

      Scenario:

      • 7 node cluster
      • Couchbase bucket with replicas=2
      • Set auto-failover with max_events=10 and timeout=10
      • Stop couchbase service on all index nodes (172.23.105.245[index+query],  172.23.100.15[index], 172.23.100.13[index+backup])
      • Failover was attempted but not done with reason 

        Number of remaining nodes that are running index service is 0. You need at least 1 nodes.

      • Bring back all 3 nodes back by starting the couchbase-service

      Observation:

      Master node (.155) saw the node was up but failed to acquire lease from the node(.245) which has resulted in the failover procedure.

      ns_server.info.log of 172.23.105.155:

      [user:info,2021-11-30T23:18:35.675-08:00,ns_1@172.23.105.155:ns_node_disco<0.435.0>:ns_node_disco:handle_info:177]Node 'ns_1@172.23.105.155' saw that node 'ns_1@172.23.105.245' came up. Tags: []
      [ns_server:info,2021-11-30T23:18:35.677-08:00,ns_1@172.23.105.155:ns_node_disco_events<0.434.0>:ns_node_disco_log:handle_event:40]ns_node_disco_log: nodes changed: ['ns_1@172.23.100.13','ns_1@172.23.100.14',
                                         'ns_1@172.23.100.15','ns_1@172.23.105.155',
                                         'ns_1@172.23.105.211',
                                         'ns_1@172.23.105.212',
                                         'ns_1@172.23.105.213',
                                         'ns_1@172.23.105.244',
                                         'ns_1@172.23.105.245']
      ns_server:warn,2021-11-30T23:18:35.678-08:00,ns_1@172.23.105.155:<0.14497.249>:leader_lease_acquire_worker:handle_exception:244]Failed to acquire lease from 'ns_1@172.23.105.245': {exit,
                                                           {noproc,
                                                            {gen_server,call,
                                                             [{leader_lease_agent,
                                                               'ns_1@172.23.105.245'},
                                                              {acquire_lease,
                                                               'ns_1@172.23.105.155',
                                                               <<"1486f6a8ae4cd211d1a68767e369fae7">>,
                                                               [{timeout,15000},
                                                                {period,15000}]},
                                                              infinity]}}}
       
      [user:info,2021-11-30T23:18:36.534-08:00,ns_1@172.23.105.155:<0.13944.249>:failover:orchestrate:150]Starting failing over ['ns_1@172.23.105.245']
      [user:info,2021-11-30T23:18:36.535-08:00,ns_1@172.23.105.155:<0.11668.0>:ns_orchestrator:handle_start_failover:1658]Starting failover of nodes ['ns_1@172.23.105.245']. Operation Id = e98a8193622d9ae1c8f83c644603b396
       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          ashwin.govindarajulu Ashwin Govindarajulu created issue -

          Note:

          Tried (once) the same scenario by keeping the failover timeout=120 seconds.

          Could not reproduce the same issue.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Note: Tried (once) the same scenario by keeping the failover timeout=120 seconds. Could not reproduce the same issue.
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Field Original Value New Value
          Link to Log File, atop/blg, CBCollectInfo, Core dump https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.100.13.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.100.14.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.100.15.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.155.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.211.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.212.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.213.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.244.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.245.zip
          http://supportal.couchbase.com/snapshot/6863003a4db70cc2d65373ed61c5417e::1

          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.100.13.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.100.14.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.100.15.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.155.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.211.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.212.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.213.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.244.zip
          https://cb-engineering.s3.amazonaws.com/node_failed_over_after_started/collectinfo-2021-12-01T072050-ns_1%40172.23.105.245.zip

          First of all the "Failed to acquire lease" is benign. It happens due to a race between the time when node became visible via network and the time when it starts it's leader_lease_agent. So other nodes might not be able to acquire lease from the first attempt.

          The overall behavior of autofailover here is correct.

          After the 3 nodes go down for an autofailover timeout time, all 3 of them reach "failover" state:

          [ns_server:debug,2021-11-30T22:33:24.798-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state:
          {node_state,{'ns_1@172.23.100.13',<<"af23abc3889b98648fc9bd7947ea0c40">>},
                      1,nearly_down,false}
          ->{node_state,{'ns_1@172.23.100.13',<<"af23abc3889b98648fc9bd7947ea0c40">>},
                        1,failover,false}
          [ns_server:debug,2021-11-30T22:33:24.799-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state:
          {node_state,{'ns_1@172.23.100.15',<<"39d292df78ebf2e24c1f926b4cd0a277">>},
                      1,nearly_down,true}
          ->{node_state,{'ns_1@172.23.100.15',<<"39d292df78ebf2e24c1f926b4cd0a277">>},
                        1,failover,true}
          [ns_server:debug,2021-11-30T22:33:24.799-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state:
          {node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>},
                      1,nearly_down,true}
          ->{node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>},
                        1,failover,true}
          

          At this point the auto failover logic determines that those nodes cannot be automatically failed over, since there will be no index nodes left in cluster as a result:

          [ns_server:debug,2021-11-30T22:33:24.800-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{mail_too_small,index,[],
                                             {'ns_1@172.23.100.13',
                                                 <<"af23abc3889b98648fc9bd7947ea0c40">>}},
                                         {mail_too_small,index,[],
                                             {'ns_1@172.23.100.15',
                                                 <<"39d292df78ebf2e24c1f926b4cd0a277">>}},
                                         {mail_too_small,index,[],
                                             {'ns_1@172.23.105.245',
                                                 <<"33bb8709d8c28575d3d1842639621669">>}}]
          [user:info,2021-11-30T22:33:24.801-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover:process_action:405]Could not auto-failover node ('ns_1@172.23.100.13'). Number of remaining nodes that are running index service is 0. You need at least 1 nodes.
          [user:info,2021-11-30T22:33:24.801-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover:process_action:405]Could not auto-failover node ('ns_1@172.23.100.15'). Number of remaining nodes that are running index service is 0. You need at least 1 nodes.
          [user:info,2021-11-30T22:33:24.801-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover:process_action:405]Could not auto-failover node ('ns_1@172.23.105.245'). Number of remaining nodes that are running index service is 0. You need at least 1 nodes.
          

          Then 2 nodes go up almost instantaneously and the third one a little bit later. The auto failover logic changes the list of candidates from 3 nodes to 2 and rollbacks the counter for the down node to 2 seconds back.

          [ns_server:debug,2021-11-30T23:18:34.517-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:process_frame:285]List of candidates changed from [{'ns_1@172.23.100.13',
                                               <<"af23abc3889b98648fc9bd7947ea0c40">>},
                                           {'ns_1@172.23.100.15',
                                               <<"39d292df78ebf2e24c1f926b4cd0a277">>},
                                           {'ns_1@172.23.105.245',
                                               <<"33bb8709d8c28575d3d1842639621669">>}] to [{'ns_1@172.23.105.245',
                                                                                             <<"33bb8709d8c28575d3d1842639621669">>}]. Resetting counter
          [ns_server:debug,2021-11-30T23:18:34.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:143]Transitioned node {'ns_1@172.23.100.13',
                                <<"af23abc3889b98648fc9bd7947ea0c40">>} state failover -> up
          [ns_server:debug,2021-11-30T23:18:34.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:143]Transitioned node {'ns_1@172.23.100.15',
                                <<"39d292df78ebf2e24c1f926b4cd0a277">>} state failover -> up
          [ns_server:debug,2021-11-30T23:18:34.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state:
          {node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>},
                      1,failover,true}
          ->{node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>},
                        0,nearly_down,true}
          

          So now all it takes to decide the new group of down nodes ['ns_1@172.23.105.245'] to be failed over is that this group can survive intact (down and no new down nodes) for 2 seconds which is a grace period.

          That's exactly what happens:

          [ns_server:debug,2021-11-30T23:18:35.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state:
          {node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>},
                      0,nearly_down,true}
          ->{node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>},
                        1,nearly_down,true}
           
          [ns_server:debug,2021-11-30T23:18:36.517-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state:
          {node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>},
                      1,nearly_down,true}
          ->{node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>},
                        1,failover,true}
          [ns_server:debug,2021-11-30T23:18:36.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{failover,
                                          [{'ns_1@172.23.105.245',
                                            <<"33bb8709d8c28575d3d1842639621669">>}]}]
          

          artem Artem Stemkovski added a comment - First of all the "Failed to acquire lease" is benign. It happens due to a race between the time when node became visible via network and the time when it starts it's leader_lease_agent. So other nodes might not be able to acquire lease from the first attempt. The overall behavior of autofailover here is correct. After the 3 nodes go down for an autofailover timeout time, all 3 of them reach "failover" state: [ns_server:debug,2021-11-30T22:33:24.798-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state: {node_state,{'ns_1@172.23.100.13',<<"af23abc3889b98648fc9bd7947ea0c40">>}, 1,nearly_down,false} ->{node_state,{'ns_1@172.23.100.13',<<"af23abc3889b98648fc9bd7947ea0c40">>}, 1,failover,false} [ns_server:debug,2021-11-30T22:33:24.799-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state: {node_state,{'ns_1@172.23.100.15',<<"39d292df78ebf2e24c1f926b4cd0a277">>}, 1,nearly_down,true} ->{node_state,{'ns_1@172.23.100.15',<<"39d292df78ebf2e24c1f926b4cd0a277">>}, 1,failover,true} [ns_server:debug,2021-11-30T22:33:24.799-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state: {node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>}, 1,nearly_down,true} ->{node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>}, 1,failover,true} At this point the auto failover logic determines that those nodes cannot be automatically failed over, since there will be no index nodes left in cluster as a result: [ns_server:debug,2021-11-30T22:33:24.800-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{mail_too_small,index,[], {'ns_1@172.23.100.13', <<"af23abc3889b98648fc9bd7947ea0c40">>}}, {mail_too_small,index,[], {'ns_1@172.23.100.15', <<"39d292df78ebf2e24c1f926b4cd0a277">>}}, {mail_too_small,index,[], {'ns_1@172.23.105.245', <<"33bb8709d8c28575d3d1842639621669">>}}] [user:info,2021-11-30T22:33:24.801-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover:process_action:405]Could not auto-failover node ('ns_1@172.23.100.13'). Number of remaining nodes that are running index service is 0. You need at least 1 nodes. [user:info,2021-11-30T22:33:24.801-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover:process_action:405]Could not auto-failover node ('ns_1@172.23.100.15'). Number of remaining nodes that are running index service is 0. You need at least 1 nodes. [user:info,2021-11-30T22:33:24.801-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover:process_action:405]Could not auto-failover node ('ns_1@172.23.105.245'). Number of remaining nodes that are running index service is 0. You need at least 1 nodes. Then 2 nodes go up almost instantaneously and the third one a little bit later. The auto failover logic changes the list of candidates from 3 nodes to 2 and rollbacks the counter for the down node to 2 seconds back. [ns_server:debug,2021-11-30T23:18:34.517-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:process_frame:285]List of candidates changed from [{'ns_1@172.23.100.13', <<"af23abc3889b98648fc9bd7947ea0c40">>}, {'ns_1@172.23.100.15', <<"39d292df78ebf2e24c1f926b4cd0a277">>}, {'ns_1@172.23.105.245', <<"33bb8709d8c28575d3d1842639621669">>}] to [{'ns_1@172.23.105.245', <<"33bb8709d8c28575d3d1842639621669">>}]. Resetting counter [ns_server:debug,2021-11-30T23:18:34.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:143]Transitioned node {'ns_1@172.23.100.13', <<"af23abc3889b98648fc9bd7947ea0c40">>} state failover -> up [ns_server:debug,2021-11-30T23:18:34.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:143]Transitioned node {'ns_1@172.23.100.15', <<"39d292df78ebf2e24c1f926b4cd0a277">>} state failover -> up [ns_server:debug,2021-11-30T23:18:34.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state: {node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>}, 1,failover,true} ->{node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>}, 0,nearly_down,true} So now all it takes to decide the new group of down nodes ['ns_1@172.23.105.245'] to be failed over is that this group can survive intact (down and no new down nodes) for 2 seconds which is a grace period. That's exactly what happens: [ns_server:debug,2021-11-30T23:18:35.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state: {node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>}, 0,nearly_down,true} ->{node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>}, 1,nearly_down,true}   [ns_server:debug,2021-11-30T23:18:36.517-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:log_master_activity:145]Incremented down state: {node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>}, 1,nearly_down,true} ->{node_state,{'ns_1@172.23.105.245',<<"33bb8709d8c28575d3d1842639621669">>}, 1,failover,true} [ns_server:debug,2021-11-30T23:18:36.518-08:00,ns_1@172.23.105.155:<0.11670.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{failover, [{'ns_1@172.23.105.245', <<"33bb8709d8c28575d3d1842639621669">>}]}]
          artem Artem Stemkovski made changes -
          Resolution Not a Bug [ 10200 ]
          Status Open [ 1 ] Resolved [ 5 ]
          artem Artem Stemkovski made changes -
          Assignee Artem Stemkovski [ artem ] Ashwin Govindarajulu [ ashwin.govindarajulu ]

          Closing all Duplicates, Not a Bug, Incomplete, Duplicate

          ritam.sharma Ritam Sharma added a comment - Closing all Duplicates, Not a Bug, Incomplete, Duplicate
          ritam.sharma Ritam Sharma made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty