Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37871

Auto-failover should not process accumulated tick messages and immediately auto-failover a node

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      In MB-37795, heavy I/O caused ns_server to pause for multiple seconds when a page fault was raised. After the pause there were a collection of tick's in the auto_failover processes mailbox. Due to the pause the state of the orchestator node in the ns_server_monitor was unhealthy - i.e. the orchestrator hadn't sent a heartbeat to itself in more than 2 seconds. Each tick is processed quickly in turn and the orchestrator goes from up to failed over in a handful of milliseconds.

      ns_server:debug,2020-02-06T12:39:00.468-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
      {node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                  0,up,false}
      ->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                    0,half_down,false}
      [ns_server:debug,2020-02-06T12:39:00.469-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
      {node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                  0,half_down,false}
      ->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                    1,half_down,false}
      [ns_server:debug,2020-02-06T12:39:00.471-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
      {node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                  1,half_down,false}
      ->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                    2,half_down,false}
      [ns_server:debug,2020-02-06T12:39:00.472-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
      {node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                  2,half_down,false}
      ->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                    0,nearly_down,false}
      [ns_server:debug,2020-02-06T12:39:00.473-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
      {node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                  0,nearly_down,false}
      ->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                    1,nearly_down,false}
      [stats:warn,2020-02-06T12:39:00.474-08:00,ns_1@172.23.97.25:<0.595.0>:base_stats_collector:latest_tick:64](Collector: stats_collector) Dropped 7 ticks
      [ns_server:debug,2020-02-06T12:39:00.475-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:log_master_activity:179]Incremented down state:
      {node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                  1,nearly_down,false}
      ->{node_state,{'ns_1@172.23.97.25',<<"00123e9cd533f0e5259b44b74ecbdbf8">>},
                    1,failover,false}
      [ns_server:debug,2020-02-06T12:39:00.475-08:00,ns_1@172.23.97.25:<0.1011.0>:auto_failover_logic:process_frame:300]Decided on following actions: [{failover,
                                         {'ns_1@172.23.97.25',
                                             <<"00123e9cd533f0e5259b44b74ecbdbf8">>}}]
      

      It's not clear to me what the right thing to do here is, however, it doesn't appear correct to fail a node over based on a sequence of observations of essentially the same state in just a few milliseconds. For instance, perhaps failover should happen based on a sequence of observations of an unhealthy state where each observation is 1 s or more apart.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            steve.watanabe Steve Watanabe
            dfinlay Dave Finlay
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty