Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7168

[Doc'd 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Duplicate
    • Affects Version/s: 2.0, 2.0.1, 2.1.0
    • Fix Version/s: 3.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Flagged:
      Release Note
    • Sprint:
      12/Aug - 30/Aug

      Description

      version=2.0.0-1947-rel
      http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/448/consoleFull
      /testrunner -i resources/jenkins/centos-64-5node-failover.ini get-logs=True,GROUP=BAT -t failovertests.FailoverTests.test_failover_firewall,replica=1,keys_count=100000,dgm_run=True,GROUP=BAT

      steps:
      1.3 nodes in cluster:10.1.3.114, 10.1.3.118,10.1.3.116
      2.enabled firewall on ip:10.1.3.118 : /sbin/iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
      3
      [2012-11-12 11:28:37,102] - [failovertests:260] INFO - node 10.1.3.118:8091 is 'unhealthy' as expected
      [2012-11-12 11:29:07,544] - [rest_client:849] INFO - fail_over successful
      [2012-11-12 11:29:07,545] - [failovertests:278] INFO - failed over node : ns_1@10.1.3.118
      [2012-11-12 11:29:07,545] - [failovertests:292] INFO - 10 seconds sleep after failover before invoking rebalance...
      4.[2012-11-12 11:29:17,545] - [rest_client:883] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.118&user=Administrator&knownNodes=ns_1%4010.1.3.114%2Cns_1%4010.1.3.118%2Cns_1%4010.1.3.116

      result:
      [2012-11-12 11:29:17,562] - [rest_client:890] INFO - rebalance operation started
      [2012-11-12 11:29:17,569] - [rest_client:986] INFO - rebalance percentage : 0 %
      [2012-11-12 11:29:19,573] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:21,577] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:23,584] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:25,589] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:27,593] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:29,599] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:31,603] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:33,607] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:35,612] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:37,616] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:39,621] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:41,626] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:43,630] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:45,635] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:47,640] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:49,644] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:51,648] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:53,652] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:55,657] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:57,662] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:59,667] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:01,671] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:03,676] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:05,682] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:07,698] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:09,703] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:11,708] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:13,712] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:15,716] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:17,721] - [rest_client:971] ERROR -

      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

      - rebalance failed

      [rebalance:info,2012-11-12T11:49:03.581,ns_1@10.1.3.114:<0.20240.1>:ns_rebalancer:rebalance:258]Waiting for bucket "default" to be ready on ['ns_1@10.1.3.114',
      'ns_1@10.1.3.116']
      [ns_server:info,2012-11-12T11:49:09.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,

      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]},
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:info,2012-11-12T11:49:19.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]}

      ,
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:info,2012-11-12T11:49:23.120,ns_1@10.1.3.114:<0.20319.1>:compaction_daemon:try_to_cleanup_indexes:439]Cleaning up indexes for bucket `default`
      [ns_server:info,2012-11-12T11:49:23.121,ns_1@10.1.3.114:<0.20319.1>:compaction_daemon:spawn_bucket_compactor:404]Compacting bucket default with config:
      [{database_fragmentation_threshold,{30,undefined}},
      {view_fragmentation_threshold,{30,undefined}}]
      [ns_server:info,2012-11-12T11:49:24.348,ns_1@10.1.3.114:ns_config_rep<0.358.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.1.3.116'

      [ns_server:info,2012-11-12T11:49:29.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,

      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]},
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:warn,2012-11-12T11:49:29.546,ns_1@10.1.3.114:capi_set_view_manager-default<0.15508.1>:capi_set_view_manager:handle_info:345]Remote server node {'capi_ddoc_replication_srv-default','ns_1@10.1.3.118'} process down: noconnection
      [ns_server:warn,2012-11-12T11:49:29.546,ns_1@10.1.3.114:xdc_rdoc_replication_srv<0.470.0>:xdc_rdoc_replication_srv:handle_info:128]Remote server node {xdc_rdoc_replication_srv,'ns_1@10.1.3.118'} process down: noconnection
      [user:warn,2012-11-12T11:49:29.546,ns_1@10.1.3.114:ns_node_disco<0.351.0>:ns_node_disco:handle_info:168]Node 'ns_1@10.1.3.114' saw that node 'ns_1@10.1.3.118' went down.
      [error_logger:error,2012-11-12T11:49:29.546,ns_1@10.1.3.114:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Node 'ns_1@10.1.3.118' not responding **
      ** Removing (timedout) connection **

      [ns_server:info,2012-11-12T11:49:30.536,ns_1@10.1.3.114:ns_config_rep<0.358.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.1.3.116'

      [ns_server:info,2012-11-12T11:49:39.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]}

      ,
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:info,2012-11-12T11:49:49.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,

      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]},
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:info,2012-11-12T11:49:53.135,ns_1@10.1.3.114:<0.20445.1>:compaction_daemon:try_to_cleanup_indexes:439]Cleaning up indexes for bucket `default`
      [ns_server:info,2012-11-12T11:49:53.136,ns_1@10.1.3.114:<0.20445.1>:compaction_daemon:spawn_bucket_compactor:404]Compacting bucket default with config:
      [{database_fragmentation_threshold,{30,undefined}},
      {view_fragmentation_threshold,{30,undefined}}]
      [ns_server:info,2012-11-12T11:49:59.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]}

      ,
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [user:info,2012-11-12T11:50:03.582,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason

      {not_all_nodes_are_ready_yet, ['ns_1@10.1.3.114','ns_1@10.1.3.116']}
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Removing this from our backlog. Given this was moved out of 2.2.0 and some upr work is planned for 3.0 (and will change a lot in this area) there's nothing that we can do at this time and we'll wait until upr stuff gets clearer.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Removing this from our backlog. Given this was moved out of 2.2.0 and some upr work is planned for 3.0 (and will change a lot in this area) there's nothing that we can do at this time and we'll wait until upr stuff gets clearer.
        Hide
        kzeller kzeller added a comment -

        Added as Known issue for 2.2 in RN.

        • A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet}

          if you perform the rebalance right after failing over a node in the cluster. You may need to wait 60 seconds after the node failover before you attempt the cluster rebalance.

        This is because the failover REST API is a synchronous operation with a timeout. If it fails to complete the failover process by the timeout, the operation internally switches into a asynchronous operation. It will immediately return and re-attempt failover in the background which will cause rebalance to fail since the failover operation is still running.

        Issues : MB-7168(http://www.couchbase.com/issues/browse/MB-7168)

        Show
        kzeller kzeller added a comment - Added as Known issue for 2.2 in RN. A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet} if you perform the rebalance right after failing over a node in the cluster. You may need to wait 60 seconds after the node failover before you attempt the cluster rebalance. This is because the failover REST API is a synchronous operation with a timeout. If it fails to complete the failover process by the timeout, the operation internally switches into a asynchronous operation. It will immediately return and re-attempt failover in the background which will cause rebalance to fail since the failover operation is still running. Issues : MB-7168 ( http://www.couchbase.com/issues/browse/MB-7168 )
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        MB-8039 fixed the problem for all but master node.

        Fixing it for master node is possible work for 3.0 but will take at least week.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - MB-8039 fixed the problem for all but master node. Fixing it for master node is possible work for 3.0 but will take at least week.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -
        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - MB-9321
        Hide
        maria Maria McDuff (Inactive) added a comment -

        closing as dupes.

        Show
        maria Maria McDuff (Inactive) added a comment - closing as dupes.

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Agile

                Gerrit Reviews

                There are no open Gerrit changes