Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 6.6.0, 7.0.0
Affects Version/s: 6.5.0
Component/s: ns_server
Labels:
- 6.6.0-candidate

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

In ~~MB-37795~~, heavy I/O caused ns_server to pause for multiple seconds when a page fault was raised. After the pause there were a collection of tick's in the auto_failover processes mailbox. Due to the pause the state of the orchestator node in the ns_server_monitor was unhealthy - i.e. the orchestrator hadn't sent a heartbeat to itself in more than 2 seconds. Each tick is processed quickly in turn and the orchestrator goes from up to failed over in a handful of milliseconds.

ns_server:debug,2020-02-06T12:39:00.468-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:

{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

            0,up,false}

->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

              0,half_down,false}

[ns_server:debug,2020-02-06T12:39:00.469-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:

{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

            0,half_down,false}

->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

              1,half_down,false}

[ns_server:debug,2020-02-06T12:39:00.471-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:

{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

            1,half_down,false}

->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

              2,half_down,false}

[ns_server:debug,2020-02-06T12:39:00.472-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:

{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

            2,half_down,false}

->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

              0,nearly_down,false}

[ns_server:debug,2020-02-06T12:39:00.473-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:

{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

            0,nearly_down,false}

->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

              1,nearly_down,false}

[stats:warn,2020-02-06T12:39:00.474-08:00,ns_1@172.23.97.25:<0.595.0>:base_stats_collector:latest_tick:64](Collector: stats_collector) Dropped 7 ticks

[ns_server:debug,2020-02-06T12:39:00.475-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:

{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

            1,nearly_down,false}

->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},

              1,failover,false}

[ns_server:debug,2020-02-06T12:39:00.475-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:process_frame:300]Decided on following actions: [{failover,

                                   {'ns_1@172.23.97.25',

                                       <<"00123e9cd533f0e5259b44b74ecbdbf8">>}}]

It's not clear to me what the right thing to do here is, however, it doesn't appear correct to fail a node over based on a sequence of observations of essentially the same state in just a few milliseconds. For instance, perhaps failover should happen based on a sequence of observations of an unhealthy state where each observation is 1 s or more apart.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Steve Watanabe

Reporter:: Dave Finlay

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Feb/20 10:37 PM

Updated:: 17/Jun/21 2:50 PM

Resolved:: 16/Apr/20 11:22 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 3 closed Gerrit changes

Hide There are 3 closed Gerrit changes

MB-37871 Auto-failover should flush older "tick" messages: Gerrit Review:

[bp] MB-37871 Auto-failover should flush older "tick" messages: Gerrit Review:

Merge remote-tracking branch 'couchbase/mad-hatter': Gerrit Review:

Auto-failover should not process accumulated tick messages and immediately auto-failover a node

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty