Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7168

[Doc'd 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Duplicate
    • Affects Version/s: 2.0, 2.0.1, 2.1.0
    • Fix Version/s: 3.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Flagged:
      Release Note
    • Sprint:
      12/Aug - 30/Aug

      Description

      version=2.0.0-1947-rel
      http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/448/consoleFull
      /testrunner -i resources/jenkins/centos-64-5node-failover.ini get-logs=True,GROUP=BAT -t failovertests.FailoverTests.test_failover_firewall,replica=1,keys_count=100000,dgm_run=True,GROUP=BAT

      steps:
      1.3 nodes in cluster:10.1.3.114, 10.1.3.118,10.1.3.116
      2.enabled firewall on ip:10.1.3.118 : /sbin/iptables -A INPUT -p tcp -i eth0 --dport 1000:60000 -j REJECT
      3
      [2012-11-12 11:28:37,102] - [failovertests:260] INFO - node 10.1.3.118:8091 is 'unhealthy' as expected
      [2012-11-12 11:29:07,544] - [rest_client:849] INFO - fail_over successful
      [2012-11-12 11:29:07,545] - [failovertests:278] INFO - failed over node : ns_1@10.1.3.118
      [2012-11-12 11:29:07,545] - [failovertests:292] INFO - 10 seconds sleep after failover before invoking rebalance...
      4.[2012-11-12 11:29:17,545] - [rest_client:883] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.118&user=Administrator&knownNodes=ns_1%4010.1.3.114%2Cns_1%4010.1.3.118%2Cns_1%4010.1.3.116

      result:
      [2012-11-12 11:29:17,562] - [rest_client:890] INFO - rebalance operation started
      [2012-11-12 11:29:17,569] - [rest_client:986] INFO - rebalance percentage : 0 %
      [2012-11-12 11:29:19,573] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:21,577] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:23,584] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:25,589] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:27,593] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:29,599] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:31,603] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:33,607] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:35,612] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:37,616] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:39,621] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:41,626] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:43,630] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:45,635] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:47,640] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:49,644] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:51,648] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:53,652] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:55,657] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:57,662] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:29:59,667] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:01,671] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:03,676] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:05,682] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:07,698] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:09,703] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:11,708] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:13,712] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:15,716] - [rest_client:986] INFO - rebalance percentage : 0.0 %
      [2012-11-12 11:30:17,721] - [rest_client:971] ERROR -

      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

      - rebalance failed

      [rebalance:info,2012-11-12T11:49:03.581,ns_1@10.1.3.114:<0.20240.1>:ns_rebalancer:rebalance:258]Waiting for bucket "default" to be ready on ['ns_1@10.1.3.114',
      'ns_1@10.1.3.116']
      [ns_server:info,2012-11-12T11:49:09.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,

      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]},
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:info,2012-11-12T11:49:19.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]}

      ,
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:info,2012-11-12T11:49:23.120,ns_1@10.1.3.114:<0.20319.1>:compaction_daemon:try_to_cleanup_indexes:439]Cleaning up indexes for bucket `default`
      [ns_server:info,2012-11-12T11:49:23.121,ns_1@10.1.3.114:<0.20319.1>:compaction_daemon:spawn_bucket_compactor:404]Compacting bucket default with config:
      [{database_fragmentation_threshold,{30,undefined}},
      {view_fragmentation_threshold,{30,undefined}}]
      [ns_server:info,2012-11-12T11:49:24.348,ns_1@10.1.3.114:ns_config_rep<0.358.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.1.3.116'

      [ns_server:info,2012-11-12T11:49:29.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,

      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]},
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:warn,2012-11-12T11:49:29.546,ns_1@10.1.3.114:capi_set_view_manager-default<0.15508.1>:capi_set_view_manager:handle_info:345]Remote server node {'capi_ddoc_replication_srv-default','ns_1@10.1.3.118'} process down: noconnection
      [ns_server:warn,2012-11-12T11:49:29.546,ns_1@10.1.3.114:xdc_rdoc_replication_srv<0.470.0>:xdc_rdoc_replication_srv:handle_info:128]Remote server node {xdc_rdoc_replication_srv,'ns_1@10.1.3.118'} process down: noconnection
      [user:warn,2012-11-12T11:49:29.546,ns_1@10.1.3.114:ns_node_disco<0.351.0>:ns_node_disco:handle_info:168]Node 'ns_1@10.1.3.114' saw that node 'ns_1@10.1.3.118' went down.
      [error_logger:error,2012-11-12T11:49:29.546,ns_1@10.1.3.114:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Node 'ns_1@10.1.3.118' not responding **
      ** Removing (timedout) connection **

      [ns_server:info,2012-11-12T11:49:30.536,ns_1@10.1.3.114:ns_config_rep<0.358.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.1.3.116'

      [ns_server:info,2012-11-12T11:49:39.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]}

      ,
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:info,2012-11-12T11:49:49.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,

      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]},
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [ns_server:info,2012-11-12T11:49:53.135,ns_1@10.1.3.114:<0.20445.1>:compaction_daemon:try_to_cleanup_indexes:439]Cleaning up indexes for bucket `default`
      [ns_server:info,2012-11-12T11:49:53.136,ns_1@10.1.3.114:<0.20445.1>:compaction_daemon:spawn_bucket_compactor:404]Compacting bucket default with config:
      [{database_fragmentation_threshold,{30,undefined}},
      {view_fragmentation_threshold,{30,undefined}}]
      [ns_server:info,2012-11-12T11:49:59.386,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:282]Skipping janitor in state rebalancing: {rebalancing_state,<0.20215.1>,
      {dict,3,16,16,8,80,48,
      {[],[],[],[],[],[],[],[],[],[],[],[], [],[],[],[]}

      ,
      {{[],[],[],[],
      [['ns_1@10.1.3.114'|0.0]],
      [],
      [['ns_1@10.1.3.116'|0.0]],
      [],
      [['ns_1@10.1.3.118'|0.0]],
      [],[],[],[],[],[],[]}}},
      ['ns_1@10.1.3.114','ns_1@10.1.3.116'],
      [],
      ['ns_1@10.1.3.118']}
      [user:info,2012-11-12T11:50:03.582,ns_1@10.1.3.114:<0.769.0>:ns_orchestrator:handle_info:319]Rebalance exited with reason

      {not_all_nodes_are_ready_yet, ['ns_1@10.1.3.114','ns_1@10.1.3.116']}
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        andreibaranouski Andrei Baranouski created issue -
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        almost all test fail due to rebalance failure after failover nodes on version=2.0.0-1949-rel
        http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/449/consoleFull

        Show
        andreibaranouski Andrei Baranouski added a comment - almost all test fail due to rebalance failure after failover nodes on version=2.0.0-1949-rel http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/449/consoleFull
        steve Steve Yen made changes -
        Field Original Value New Value
        Fix Version/s 2.0 [ 10114 ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        if this is an automated test please rerun and confrm the behavior twice

        Show
        farshid Farshid Ghods (Inactive) added a comment - if this is an automated test please rerun and confrm the behavior twice
        Hide
        steve Steve Yen added a comment -

        per bug-scrub2

        Show
        steve Steve Yen added a comment - per bug-scrub2
        steve Steve Yen made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Andrei Baranouski [ andreibaranouski ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        3 tests are constantly falling with the same reason http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/457/consoleFull ( 1954 build)

        Show
        andreibaranouski Andrei Baranouski added a comment - 3 tests are constantly falling with the same reason http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/457/consoleFull ( 1954 build)
        andreibaranouski Andrei Baranouski made changes -
        Assignee Andrei Baranouski [ andreibaranouski ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        iryna iryna added a comment - - edited

        can be easily reproduced for windows - more info is in duplicate bug MB-7205 (reproduced manually)

        Show
        iryna iryna added a comment - - edited can be easily reproduced for windows - more info is in duplicate bug MB-7205 (reproduced manually)
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        so this failure occurs when the other node is unavailable ?

        Iryna , when you reproduce this on windows manually does the rebalance succeed second time ?
        do you see any data loss ? does rebalance fail immediately ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - so this failure occurs when the other node is unavailable ? Iryna , when you reproduce this on windows manually does the rebalance succeed second time ? do you see any data loss ? does rebalance fail immediately ?
        Hide
        iryna iryna added a comment -

        it not succeed after retrying during some time window, after about 15 minutes it succeed
        there is no data loss,
        rebalance fails in first 1-5 mins of rebalancing

        Show
        iryna iryna added a comment - it not succeed after retrying during some time window, after about 15 minutes it succeed there is no data loss, rebalance fails in first 1-5 mins of rebalancing
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        I cannot see any clear evidence of what caused this. But it looks bad enough. 60 seconds timeout to query vbucket states on both nodes was hit here.

        I also see that janitor pass immediately preceding rebalance also failed but in different phase, where it was waiting for vbucket change requests to be complete also failed on both nodes. And this timeout is 30 seconds.

        I'd like diag to be grabbed immediately after rebalance fails so that I could see what is the state of janitor_ageng and ns_memcached on each node. I.e. without doing any cleanup please. May I have that ?

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - I cannot see any clear evidence of what caused this. But it looks bad enough. 60 seconds timeout to query vbucket states on both nodes was hit here. I also see that janitor pass immediately preceding rebalance also failed but in different phase, where it was waiting for vbucket change requests to be complete also failed on both nodes. And this timeout is 30 seconds. I'd like diag to be grabbed immediately after rebalance fails so that I could see what is the state of janitor_ageng and ns_memcached on each node. I.e. without doing any cleanup please. May I have that ?
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        See above

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - See above
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Farshid Ghods [ farshid ]
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Farshid Ghods [ farshid ] Andrei Baranouski [ andreibaranouski ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        logs immediately after rebalance failure

        Show
        andreibaranouski Andrei Baranouski added a comment - logs immediately after rebalance failure
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        rebalance is successful in second time

        Show
        andreibaranouski Andrei Baranouski added a comment - rebalance is successful in second time
        andreibaranouski Andrei Baranouski made changes -
        Assignee Andrei Baranouski [ andreibaranouski ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        It's not quite immediately, sadly. Few seconds after problem. And backtraces on .118 (node where we failed waiting) don't have anything bad sadly.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - It's not quite immediately, sadly. Few seconds after problem. And backtraces on .118 (node where we failed waiting) don't have anything bad sadly.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        given that rebalance succeeded second time moving this to 2.0.1

        Show
        farshid Farshid Ghods (Inactive) added a comment - given that rebalance succeeded second time moving this to 2.0.1
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.0.1 [ 10399 ]
        Fix Version/s 2.0 [ 10114 ]
        dipti Dipti Borkar made changes -
        Priority Critical [ 2 ] Blocker [ 1 ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Andrei,

        is this test fialing with the latest 2.0.1 build ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Andrei, is this test fialing with the latest 2.0.1 build ?
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Andrei Baranouski [ andreibaranouski ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -
        Show
        andreibaranouski Andrei Baranouski added a comment - yes, 3 tests constantly fall in failover job: http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-failover-tests/501/consoleFull
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        2.0.1-120-rel

        Show
        andreibaranouski Andrei Baranouski added a comment - 2.0.1-120-rel
        andreibaranouski Andrei Baranouski made changes -
        Assignee Andrei Baranouski [ andreibaranouski ] Aleksey Kondratenko [ alkondratenko ]
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Aliaksey Artamonau [ aliaksey artamonau ]
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Assignee Aliaksey Artamonau [ aliaksey artamonau ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        per bug scrub

        retrying rebalance after a few minutes should work.

        Show
        farshid Farshid Ghods (Inactive) added a comment - per bug scrub retrying rebalance after a few minutes should work.
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Farshid Ghods [ farshid ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Aliaksey,

        i know this was discussed before but we want to confirm what the test does with your conclusion

        we put the node behind the firewall , then wait until node is marked as unhealthy by NS_SERVER
        then failover this node and click on rebalance to eject the node ( ejectedNodes = 10.1.3.118 )

        we want to know why when the node was already failed over from the cluster we wait for it be ready as part of rebalance. ( or the timeouts happened on existing nodes ? )

        Show
        farshid Farshid Ghods (Inactive) added a comment - Aliaksey, i know this was discussed before but we want to confirm what the test does with your conclusion we put the node behind the firewall , then wait until node is marked as unhealthy by NS_SERVER then failover this node and click on rebalance to eject the node ( ejectedNodes = 10.1.3.118 ) we want to know why when the node was already failed over from the cluster we wait for it be ready as part of rebalance. ( or the timeouts happened on existing nodes ? )
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Farshid Ghods [ farshid ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        This is quite subtle to explain fully, but main problem is we have to wait until failover actually completes internally and that is subject to timeouts in certain areas.

        So if you keep trying for 2-3 minutes it should eventually work.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - This is quite subtle to explain fully, but main problem is we have to wait until failover actually completes internally and that is subject to timeouts in certain areas. So if you keep trying for 2-3 minutes it should eventually work.
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Farshid Ghods [ farshid ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        >>but main problem is we have to wait until failover actually completes internally and that is subject to timeouts in certain areas.
        so when failover REST api returns it does not mean that failover process internally is completed.
        is failover a synchronous call or asynchronous ? in case its asyncronous is there a way for a test to check before initiating a rebalance so that we have a test that is more deterministic

        Show
        farshid Farshid Ghods (Inactive) added a comment - >>but main problem is we have to wait until failover actually completes internally and that is subject to timeouts in certain areas. so when failover REST api returns it does not mean that failover process internally is completed. is failover a synchronous call or asynchronous ? in case its asyncronous is there a way for a test to check before initiating a rebalance so that we have a test that is more deterministic
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Farshid Ghods [ farshid ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        It is best-effort sync. If it fails to be sync with reasonably short timeout, it'll silently become async.

        There's no way to detect that right now.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - It is best-effort sync. If it fails to be sync with reasonably short timeout, it'll silently become async. There's no way to detect that right now.
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Farshid Ghods [ farshid ]
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Farshid Ghods [ farshid ] Jin Lim [ jin ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        per bug scrub - assigning to Jin

        Show
        farshid Farshid Ghods (Inactive) added a comment - per bug scrub - assigning to Jin
        Hide
        jin Jin Lim added a comment -

        A few things came out of the engineering talk regarding this issue:
        1) This is a good catch in that it is confirming we need a better way of handling the api since it cannot %100 warranty the completion of failover.
        2) However, it is not a critical or blocker since the symptom is more obvious (highly probable) while running an automated testing case.
        3) Only feasible approach (per ns server team) for now is to wait and retry.

        Based on this and the fact the fix would require changes across components (ep engine, etc) we may want to consider to put this into a future enhancement.
        Assign this to Yaseen for his input here. Pelase assign it back to Jin or Dipti afterwards. Thanks.

        Show
        jin Jin Lim added a comment - A few things came out of the engineering talk regarding this issue: 1) This is a good catch in that it is confirming we need a better way of handling the api since it cannot %100 warranty the completion of failover. 2) However, it is not a critical or blocker since the symptom is more obvious (highly probable) while running an automated testing case. 3) Only feasible approach (per ns server team) for now is to wait and retry. Based on this and the fact the fix would require changes across components (ep engine, etc) we may want to consider to put this into a future enhancement. Assign this to Yaseen for his input here. Pelase assign it back to Jin or Dipti afterwards. Thanks.
        Hide
        jin Jin Lim added a comment -

        Please see the above comment.

        Show
        jin Jin Lim added a comment - Please see the above comment.
        jin Jin Lim made changes -
        Assignee Jin Lim [ jin ] Rahim Yaseen [ yaseen ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Thanks Jin for confirming this. I think this expected behavior can be included in the release notes as well so that users and support team can be aware of the issue and the suggested workaround.

        Andrei,
        can you then modify the test accordingly.

        Show
        farshid Farshid Ghods (Inactive) added a comment - Thanks Jin for confirming this. I think this expected behavior can be included in the release notes as well so that users and support team can be aware of the issue and the suggested workaround. Andrei, can you then modify the test accordingly.
        kzeller kzeller made changes -
        Labels 2.0.1-release-notes
        Flagged [Release Note]
        jin Jin Lim made changes -
        Assignee Rahim Yaseen [ yaseen ] Jin Lim [ jin ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        per bug scrub :

        please revise the test accordingly and after running the test for few times can you propose how long customer should wait before kicking the rebalance again

        Show
        farshid Farshid Ghods (Inactive) added a comment - per bug scrub : please revise the test accordingly and after running the test for few times can you propose how long customer should wait before kicking the rebalance again
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Jin Lim [ jin ] Andrei Baranouski [ andreibaranouski ]
        Hide
        jin Jin Lim added a comment - - edited

        Please assign it back to Jin after having enough information for recommendation. Will follow up with the doc team.

        Show
        jin Jin Lim added a comment - - edited Please assign it back to Jin after having enough information for recommendation. Will follow up with the doc team.
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Andrei Baranouski [ andreibaranouski ] Deepkaran Salooja [ deepkaran.salooja ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        with timeout 90 sec after failover befor rebalance - tests passed
        will launch with 60 sec

        Show
        andreibaranouski Andrei Baranouski added a comment - with timeout 90 sec after failover befor rebalance - tests passed will launch with 60 sec
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        can you try with lower timeout ( 10 seconds for instance ) ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - can you try with lower timeout ( 10 seconds for instance ) ?
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        10 secs was in initial state for ticket.

        [2012-11-12 11:29:07,545] - [failovertests:292] INFO - 10 seconds sleep after failover before invoking rebalance...

        60 secs tests passed

        will try with 30 sec

        Show
        andreibaranouski Andrei Baranouski added a comment - 10 secs was in initial state for ticket. [2012-11-12 11:29:07,545] - [failovertests:292] INFO - 10 seconds sleep after failover before invoking rebalance... 60 secs tests passed will try with 30 sec
        Hide
        jin Jin Lim added a comment -

        any update on the test w/ 30 sec?

        Show
        jin Jin Lim added a comment - any update on the test w/ 30 sec?
        Show
        andreibaranouski Andrei Baranouski added a comment - 30 sec, build 151, tests passed http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-failover-tests/536/
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Karen,

        could you please add this to the release notes that the user needs to wait for 30 seconds before they attempt to run rebalance operation after failing over a node.
        comment son this bug should explain under what conditions this 30 seconds delay is neeeded

        Show
        farshid Farshid Ghods (Inactive) added a comment - Karen, could you please add this to the release notes that the user needs to wait for 30 seconds before they attempt to run rebalance operation after failing over a node. comment son this bug should explain under what conditions this 30 seconds delay is neeeded
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Deepkaran Salooja [ deepkaran.salooja ] Karen Zeller [ kzeller ]
        farshid Farshid Ghods (Inactive) made changes -
        Component/s documentation [ 10012 ]
        Component/s ns_server [ 10019 ]
        kzeller kzeller made changes -
        Summary Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node [RN 2.0.1]]Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node
        Hide
        kzeller kzeller added a comment -

        Jin,

        Can you summarize the situation when someone should wait 30 seconds to reattempt rebalance?

        Thanks!

        Show
        kzeller kzeller added a comment - Jin, Can you summarize the situation when someone should wait 30 seconds to reattempt rebalance? Thanks!
        kzeller kzeller made changes -
        Assignee Karen Zeller [ kzeller ] Jin Lim [ jin ]
        kzeller kzeller made changes -
        Planned End (re-schedule end date based on new assignee)
        Hide
        jin Jin Lim added a comment - - edited

        Failover REST api is sync operation with timeout. When it fails to complete the failover process within the timeout period, it internally switches to async operation (continues the failover to completion) and immediately returns. Subsequent rebalance in this case would fail because the failover process is still running. User can wait between 30 seconds upto a minute and reattempt rebalance.

        Show
        jin Jin Lim added a comment - - edited Failover REST api is sync operation with timeout. When it fails to complete the failover process within the timeout period, it internally switches to async operation (continues the failover to completion) and immediately returns. Subsequent rebalance in this case would fail because the failover process is still running. User can wait between 30 seconds upto a minute and reattempt rebalance.
        jin Jin Lim made changes -
        Assignee Jin Lim [ jin ] Karen Zeller [ kzeller ]
        jin Jin Lim made changes -
        Planned End (re-schedule end date based on new assignee)
        Hide
        jin Jin Lim added a comment -

        The failover REST api timeout is 30 second.

        Show
        jin Jin Lim added a comment - The failover REST api timeout is 30 second.
        Hide
        kzeller kzeller added a comment -

        <para>
        A cluster rebalance may exit and produce the error

        {not_all_nodes_are_ready_yet}

        if you perform the rebalance right
        after failing over a node in the cluster. You may need to
        wait 30 seconds after the node failover before you
        attempt the cluster rebalance.
        </para>
        <para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to
        complete the failover process by the timeout, the operation internally switches into a
        asynchronous operation. It will immediately return and re-attempt failover in the background which will cause
        rebalance to fail since the failover operation is still running.</para>

        Show
        kzeller kzeller added a comment - <para> A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet} if you perform the rebalance right after failing over a node in the cluster. You may need to wait 30 seconds after the node failover before you attempt the cluster rebalance. </para> <para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to complete the failover process by the timeout, the operation internally switches into a asynchronous operation. It will immediately return and re-attempt failover in the background which will cause rebalance to fail since the failover operation is still running.</para>
        kzeller kzeller made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        kzeller kzeller added a comment -

        Added to RN as :

        <para>
        A cluster rebalance may exit and produce the error

        {not_all_nodes_are_ready_yet}

        if you perform the rebalance right
        after failing over a node in the cluster. You may need to
        wait 30 seconds after the node failover before you
        attempt the cluster rebalance.
        </para>
        <para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to
        complete the failover process by the timeout, the operation internally switches into a
        asynchronous operation. It will immediately return and re-attempt failover in the background which will cause
        rebalance to fail since the failover operation is still running.</para>

        Show
        kzeller kzeller added a comment - Added to RN as : <para> A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet} if you perform the rebalance right after failing over a node in the cluster. You may need to wait 30 seconds after the node failover before you attempt the cluster rebalance. </para> <para>This is because the failover REST API is a synchronous operation with a timeout. If it fails to complete the failover process by the timeout, the operation internally switches into a asynchronous operation. It will immediately return and re-attempt failover in the background which will cause rebalance to fail since the failover operation is still running.</para>
        kzeller kzeller made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        Jin, I got the same error with 30 seconds waiting after failover on build 2.0.2-761-rel
        Should we update RN for 2.0.1/next 2.0.2 release?
        http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-failover-tests/583/consoleFull

        [2013-04-08 19:41:13,981] - [failovertests:148] INFO - failed over node : ns_1@10.1.3.115
        [2013-04-08 19:41:13,981] - [failovertests:163] INFO - 30 seconds sleep after failover before invoking rebalance...
        [2013-04-08 19:41:43,981] - [rest_client:834] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.115&user=Administrator&knownNodes=ns_1%4010.1.3.114%2Cns_1%4010.1.3.115%2Cns_1%4010.1.3.118%2Cns_1%4010.1.3.116
        [2013-04-08 19:41:43,991] - [rest_client:838] INFO - rebalance operation started
        [2013-04-08 19:41:44,004] - [rest_client:940] INFO - rebalance percentage : 0 %
        [2013-04-08 19:41:46,009] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:41:48,014] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:41:50,018] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:41:52,023] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:41:54,029] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:41:56,033] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:41:58,039] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:00,043] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:02,048] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:04,056] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:06,061] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:08,066] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:10,071] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:12,077] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:14,084] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:16,089] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:18,096] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:20,102] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:22,108] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:24,113] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:26,119] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:28,125] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:30,130] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:32,137] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:34,143] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:36,152] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:38,157] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:40,163] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:42,168] - [rest_client:940] INFO - rebalance percentage : 0.0 %
        [2013-04-08 19:42:44,174] - [rest_client:923] ERROR -

        {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

        - rebalance failed
        [2013-04-08 19:42:44,175] - [rest_client:924] INFO - Latest logs from UI:
        [2013-04-08 19:42:44,273] - [rest_client:925] ERROR - {u'node': u'ns_1@10.1.3.114', u'code': 2, u'text': u"Rebalance exited with reason {not_all_nodes_are_ready_yet,\n

        Show
        andreibaranouski Andrei Baranouski added a comment - Jin, I got the same error with 30 seconds waiting after failover on build 2.0.2-761-rel Should we update RN for 2.0.1/next 2.0.2 release? http://qa.hq.northscale.net/view/2.0.1/job/centos-64-2.0-failover-tests/583/consoleFull [2013-04-08 19:41:13,981] - [failovertests:148] INFO - failed over node : ns_1@10.1.3.115 [2013-04-08 19:41:13,981] - [failovertests:163] INFO - 30 seconds sleep after failover before invoking rebalance... [2013-04-08 19:41:43,981] - [rest_client:834] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.115&user=Administrator&knownNodes=ns_1%4010.1.3.114%2Cns_1%4010.1.3.115%2Cns_1%4010.1.3.118%2Cns_1%4010.1.3.116 [2013-04-08 19:41:43,991] - [rest_client:838] INFO - rebalance operation started [2013-04-08 19:41:44,004] - [rest_client:940] INFO - rebalance percentage : 0 % [2013-04-08 19:41:46,009] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:41:48,014] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:41:50,018] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:41:52,023] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:41:54,029] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:41:56,033] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:41:58,039] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:00,043] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:02,048] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:04,056] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:06,061] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:08,066] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:10,071] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:12,077] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:14,084] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:16,089] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:18,096] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:20,102] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:22,108] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:24,113] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:26,119] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:28,125] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:30,130] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:32,137] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:34,143] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:36,152] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:38,157] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:40,163] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:42,168] - [rest_client:940] INFO - rebalance percentage : 0.0 % [2013-04-08 19:42:44,174] - [rest_client:923] ERROR - {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'} - rebalance failed [2013-04-08 19:42:44,175] - [rest_client:924] INFO - Latest logs from UI: [2013-04-08 19:42:44,273] - [rest_client:925] ERROR - {u'node': u'ns_1@10.1.3.114', u'code': 2, u'text': u"Rebalance exited with reason {not_all_nodes_are_ready_yet,\n
        andreibaranouski Andrei Baranouski made changes -
        Resolution Fixed [ 1 ]
        Status Closed [ 6 ] Reopened [ 4 ]
        Assignee Karen Zeller [ kzeller ] Jin Lim [ jin ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        it's happened twice in this run

        Show
        andreibaranouski Andrei Baranouski added a comment - it's happened twice in this run
        Hide
        jin Jin Lim added a comment -

        Thanks for the update, Andrei. Before we update the RN first let's figure out how long a user should wait before retrying the rebalance. Can we upgrade the wait period to 1 minute and see if that is long enough? Thanks.

        Show
        jin Jin Lim added a comment - Thanks for the update, Andrei. Before we update the RN first let's figure out how long a user should wait before retrying the rebalance. Can we upgrade the wait period to 1 minute and see if that is long enough? Thanks.
        jin Jin Lim made changes -
        Assignee Jin Lim [ jin ] Andrei Baranouski [ andreibaranouski ]
        Show
        andreibaranouski Andrei Baranouski added a comment - set timeout in 1 min http://review.couchbase.org/#/c/25587/ before that we had a timeout for 60 sec, the tests do not fall http://www.couchbase.com/issues/browse/MB-7168?focusedCommentId=49128&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-49128
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in ui-testing #42 (See http://qa.hq.northscale.net/job/ui-testing/42/)

        Result = SUCCESS

        Show
        thuan Thuan Nguyen added a comment - Integrated in ui-testing #42 (See http://qa.hq.northscale.net/job/ui-testing/42/ ) Result = SUCCESS
        kzeller kzeller made changes -
        Summary [RN 2.0.1]]Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node [RN 2.0.2?]]Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node
        Labels 2.0.1-release-notes 2.0.1-release-notes 2.0.2-release-notes
        Hide
        andreibaranouski Andrei Baranouski added a comment - - edited

        tested with timeout for 60 sec, all tests passed

        Show
        andreibaranouski Andrei Baranouski added a comment - - edited tested with timeout for 60 sec, all tests passed
        andreibaranouski Andrei Baranouski made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        Jin, I think to add in RN waiting for 60 seconds after failover should be okay

        Show
        andreibaranouski Andrei Baranouski added a comment - Jin, I think to add in RN waiting for 60 seconds after failover should be okay
        andreibaranouski Andrei Baranouski made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        andreibaranouski Andrei Baranouski made changes -
        Assignee Andrei Baranouski [ andreibaranouski ] Jin Lim [ jin ]
        dipti Dipti Borkar made changes -
        Rank Ranked higher
        wayne Wayne Siu made changes -
        Summary [RN 2.0.2?]]Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node [RN 2.0.2]Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node
        wayne Wayne Siu made changes -
        Fix Version/s 2.0.2 [ 10418 ]
        Fix Version/s 2.0.1 [ 10399 ]
        wayne Wayne Siu made changes -
        Affects Version/s 2.0.1 [ 10399 ]
        wayne Wayne Siu made changes -
        Priority Blocker [ 1 ] Critical [ 2 ]
        Hide
        wayne Wayne Siu added a comment -

        Karen, please update the release notes to suggest 60 sec (not 30). Also lowering the priority from Blocker to Critical as it's not a 2.0.2 release blocking issue.

        Show
        wayne Wayne Siu added a comment - Karen, please update the release notes to suggest 60 sec (not 30). Also lowering the priority from Blocker to Critical as it's not a 2.0.2 release blocking issue.
        wayne Wayne Siu made changes -
        Assignee Jin Lim [ jin ] Karen Zeller [ kzeller ]
        Hide
        kzeller kzeller added a comment -

        Ok, changed to : A cluster rebalance may exit and produce the error

        {not_all_nodes_are_ready_yet}

        if you perform the rebalance right
        after failing over a node in the cluster. You may need to
        wait 60 seconds after the node failover before you
        attempt the cluster rebalance.

        in RN 2.0.2

        Show
        kzeller kzeller added a comment - Ok, changed to : A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet} if you perform the rebalance right after failing over a node in the cluster. You may need to wait 60 seconds after the node failover before you attempt the cluster rebalance. in RN 2.0.2
        Hide
        kzeller kzeller added a comment -

        added to RN 2.0.2

        Show
        kzeller kzeller added a comment - added to RN 2.0.2
        kzeller kzeller made changes -
        Assignee Karen Zeller [ kzeller ] Wayne Siu [ wayne ]
        kzeller kzeller made changes -
        Component/s documentation [ 10012 ]
        kzeller kzeller made changes -
        Summary [RN 2.0.2]Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node [Doc'd 2.0.2] Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in win-ui-testing-P0 #56 (See http://qa.hq.northscale.net/job/win-ui-testing-P0/56/)
        MB-7168: sleep 30 sec after reb falied when killed memcached (Revision 44f755b962b7987d5b972caf9a283baa95edaed1)

        Result = SUCCESS
        andrei :
        Files :

        • pytests/swaprebalance.py
        Show
        thuan Thuan Nguyen added a comment - Integrated in win-ui-testing-P0 #56 (See http://qa.hq.northscale.net/job/win-ui-testing-P0/56/ ) MB-7168 : sleep 30 sec after reb falied when killed memcached (Revision 44f755b962b7987d5b972caf9a283baa95edaed1) Result = SUCCESS andrei : Files : pytests/swaprebalance.py
        wayne Wayne Siu made changes -
        Summary [Doc'd 2.0.2] Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node [RN 2.0.2] Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node
        Hide
        wayne Wayne Siu added a comment - - edited

        Verified that the doc change has also gone to 2.0.1 RN.

        http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-rn_2-0-0l.html

        Show
        wayne Wayne Siu added a comment - - edited Verified that the doc change has also gone to 2.0.1 RN. http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-server-rn_2-0-0l.html
        wayne Wayne Siu made changes -
        Assignee Wayne Siu [ wayne ] Anil Kumar [ anil ]
        wayne Wayne Siu made changes -
        Affects Version/s 2.0.2 [ 10418 ]
        wayne Wayne Siu made changes -
        Fix Version/s 2.1 [ 10414 ]
        Fix Version/s 2.0.2 [ 10418 ]
        Hide
        wayne Wayne Siu added a comment -

        Assigning to PM for the next step.

        Show
        wayne Wayne Siu added a comment - Assigning to PM for the next step.
        Hide
        anil Anil Kumar added a comment -

        discussed with ALK, this bug will be fixed in 2.1. talked about having UI alert but since this happens only when node dies completely for now Release Note as Known Issue should fine. thanks

        Show
        anil Anil Kumar added a comment - discussed with ALK, this bug will be fixed in 2.1. talked about having UI alert but since this happens only when node dies completely for now Release Note as Known Issue should fine. thanks
        anil Anil Kumar made changes -
        Assignee Anil Kumar [ anil ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Updated ticket to reflect badness of this and must-have-ness for 2.1

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Updated ticket to reflect badness of this and must-have-ness for 2.1
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Summary [RN 2.0.2] Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node [RN 2.0.2] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)
        Priority Critical [ 2 ] Blocker [ 1 ]
        kzeller kzeller made changes -
        Summary [RN 2.0.2] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node) [Done- RN 2.0.2] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)
        Hide
        kzeller kzeller added a comment -

        added flag to include 2.0.1 release note to 2.0.2.

        Show
        kzeller kzeller added a comment - added flag to include 2.0.1 release note to 2.0.2.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in windows32_sanity_P0 #29 (See http://qa.hq.northscale.net/job/windows32_sanity_P0/29/)

        Result = UNSTABLE

        Show
        thuan Thuan Nguyen added a comment - Integrated in windows32_sanity_P0 #29 (See http://qa.hq.northscale.net/job/windows32_sanity_P0/29/ ) Result = UNSTABLE
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in windows32_view_P0 #6 (See http://qa.hq.northscale.net/job/windows32_view_P0/6/)

        Result = UNSTABLE

        Show
        thuan Thuan Nguyen added a comment - Integrated in windows32_view_P0 #6 (See http://qa.hq.northscale.net/job/windows32_view_P0/6/ ) Result = UNSTABLE
        kzeller kzeller made changes -
        Summary [Done- RN 2.0.2] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node) [Doc'd] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)
        maria Maria McDuff (Inactive) made changes -
        Fix Version/s 2.2.0 [ 10620 ]
        Fix Version/s .major-release [ 10414 ]
        kzeller kzeller made changes -
        Labels 2.0.1-release-notes 2.0.2-release-notes
        Hide
        kzeller kzeller added a comment -

        Removing RN flag until ID'd by QA/Eng for 2.2.0

        Show
        kzeller kzeller added a comment - Removing RN flag until ID'd by QA/Eng for 2.2.0
        kzeller kzeller made changes -
        Summary [Doc'd] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node) [Doc 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Labels ns_server-story
        anil Anil Kumar made changes -
        Rank Ranked higher
        anil Anil Kumar made changes -
        Rank Ranked higher
        anil Anil Kumar made changes -
        Rank Ranked lower
        anil Anil Kumar made changes -
        Rank Ranked higher
        anil Anil Kumar made changes -
        Rank Ranked higher
        anil Anil Kumar made changes -
        Rank Ranked higher
        anil Anil Kumar made changes -
        Sprint Sprint 1 [ 37 ]
        mikew Mike Wiederhold made changes -
        Component/s ns_server [ 10019 ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        We've discussed this as part of scrum planning.

        The thinking is that upr changes have best chance of addressing this. Otherwise hard and too late for 2.2.0

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - We've discussed this as part of scrum planning. The thinking is that upr changes have best chance of addressing this. Otherwise hard and too late for 2.2.0
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Fix Version/s 3.0 [ 10414 ]
        Fix Version/s 2.2.0 [ 10620 ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Removing this from our backlog. Given this was moved out of 2.2.0 and some upr work is planned for 3.0 (and will change a lot in this area) there's nothing that we can do at this time and we'll wait until upr stuff gets clearer.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Removing this from our backlog. Given this was moved out of 2.2.0 and some upr work is planned for 3.0 (and will change a lot in this area) there's nothing that we can do at this time and we'll wait until upr stuff gets clearer.
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Labels ns_server-story
        kzeller kzeller made changes -
        Summary [Doc 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node) [RN 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)
        kzeller kzeller made changes -
        Summary [RN 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node) [Doc'd 2.2.0] failover of node that's completely down is still not quick (was: Rebalance exited with reason {not_all_nodes_are_ready_yet after failover node)
        Hide
        kzeller kzeller added a comment -

        Added as Known issue for 2.2 in RN.

        • A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet}

          if you perform the rebalance right after failing over a node in the cluster. You may need to wait 60 seconds after the node failover before you attempt the cluster rebalance.

        This is because the failover REST API is a synchronous operation with a timeout. If it fails to complete the failover process by the timeout, the operation internally switches into a asynchronous operation. It will immediately return and re-attempt failover in the background which will cause rebalance to fail since the failover operation is still running.

        Issues : MB-7168(http://www.couchbase.com/issues/browse/MB-7168)

        Show
        kzeller kzeller added a comment - Added as Known issue for 2.2 in RN. A cluster rebalance may exit and produce the error {not_all_nodes_are_ready_yet} if you perform the rebalance right after failing over a node in the cluster. You may need to wait 60 seconds after the node failover before you attempt the cluster rebalance. This is because the failover REST API is a synchronous operation with a timeout. If it fails to complete the failover process by the timeout, the operation internally switches into a asynchronous operation. It will immediately return and re-attempt failover in the background which will cause rebalance to fail since the failover operation is still running. Issues : MB-7168 ( http://www.couchbase.com/issues/browse/MB-7168 )
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        MB-8039 fixed the problem for all but master node.

        Fixing it for master node is possible work for 3.0 but will take at least week.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - MB-8039 fixed the problem for all but master node. Fixing it for master node is possible work for 3.0 but will take at least week.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -
        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - MB-9321
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Duplicate [ 3 ]
        Hide
        maria Maria McDuff (Inactive) added a comment -

        closing as dupes.

        Show
        maria Maria McDuff (Inactive) added a comment - closing as dupes.
        maria Maria McDuff (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Agile

                Gerrit Reviews

                There are no open Gerrit changes