Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.5.0
-
Untriaged
-
Unknown
Description
In MB-37795, heavy I/O caused ns_server to pause for multiple seconds when a page fault was raised. After the pause there were a collection of tick's in the auto_failover processes mailbox. Due to the pause the state of the orchestator node in the ns_server_monitor was unhealthy - i.e. the orchestrator hadn't sent a heartbeat to itself in more than 2 seconds. Each tick is processed quickly in turn and the orchestrator goes from up to failed over in a handful of milliseconds.
ns_server:debug,2020-02-06T12:39:00.468-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
|
{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
0,up,false}
|
->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
0,half_down,false}
|
[ns_server:debug,2020-02-06T12:39:00.469-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
|
{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
0,half_down,false}
|
->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
1,half_down,false}
|
[ns_server:debug,2020-02-06T12:39:00.471-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
|
{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
1,half_down,false}
|
->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
2,half_down,false}
|
[ns_server:debug,2020-02-06T12:39:00.472-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
|
{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
2,half_down,false}
|
->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
0,nearly_down,false}
|
[ns_server:debug,2020-02-06T12:39:00.473-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
|
{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
0,nearly_down,false}
|
->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
1,nearly_down,false}
|
[stats:warn,2020-02-06T12:39:00.474-08:00,ns_1@172.23.97.25:<0.595.0>:base_stats_collector:latest_tick:64](Collector: stats_collector) Dropped 7 ticks
|
[ns_server:debug,2020-02-06T12:39:00.475-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
|
{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
1,nearly_down,false}
|
->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
|
1,failover,false}
|
[ns_server:debug,2020-02-06T12:39:00.475-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:process_frame:300]Decided on following actions: [{failover,
|
{'ns_1@172.23.97.25',
|
<<"00123e9cd533f0e5259b44b74ecbdbf8">>}}]
|
It's not clear to me what the right thing to do here is, however, it doesn't appear correct to fail a node over based on a sequence of observations of essentially the same state in just a few milliseconds. For instance, perhaps failover should happen based on a sequence of observations of an unhealthy state where each observation is 1 s or more apart.