Description
http://qa.hq.northscale.net/job/centos-64-2.0-upgrade/89/consoleFull
./testrunner -i /tmp/upgrade.ini get-logs=False,upgrade_version=2.0.1-169-rel,initial_vbuckets=64 -t newupgradetests.MultiNodesUpgradeTests.online_upgrade_rebalance_in_out,initial_version=2.0.0-1976-rel,items=50000,expire_time=10000,wait_expire=true,GROUP=2_0;ONLINE
steps:
1. cluster with 2 2.0.0-1976 nodes (10.3.3.11, 10.3.3.14)
2. 50000 items with expiration=10000
3. install 2.0.1-169 on 10.3.3.13, 10.3.3.16
4. after installation test slept 10000 seconds( ~ 3 hours) and then tried to add new nodes to cluster
[2013-02-28 10:49:37,885] - [basetestcase:147] INFO - sleep for 10 secs. Installation of new version is done. Wait for rebalance ...
[2013-02-28 10:49:47,896] - [basetestcase:147] INFO - sleep for 10000 secs. ...
[2013-02-28 13:36:28,396] - [task:242] INFO - adding node 10.3.3.13:8091 to cluster
[2013-02-28 13:36:28,401] - [rest_client:721] INFO - adding remote node @10.3.3.13:8091 to this cluster @10.3.3.11:8091
[2013-02-28 13:36:31,520] - [task:242] INFO - adding node 10.3.3.16:8091 to cluster
[2013-02-28 13:36:31,521] - [rest_client:721] INFO - adding remote node @10.3.3.16:8091 to this cluster @10.3.3.11:8091
[2013-02-28 13:36:32,538] - [rest_client:578] ERROR - http://10.3.3.11:8091/controller/addNode error 400 reason: unknown ["Prepare join failed. Could not connect to \"10.3.3.16\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers."]
[2013-02-28 13:36:32,538] - [rest_client:741] ERROR - add_node error : ["Prepare join failed. Could not connect to \"10.3.3.16\" on port 8091. This could be due to an incorrect host/port combination or a firewall in place between the servers."]
ERROR
10.3.3.16 was not added and its logs contain at that time:
[ns_server:error,2013-02-28T13:36:13.437,ns_1@127.0.0.1:ns_heart<0.26082.0>:ns_heart:grab_samples_loading_tasks:328]Failed to grab samples loader tasks: {exit,
{noproc,
{gen_server,call,
[samples_loader_tasks,get_tasks,
2000]}},
[
{ns_heart,grab_samples_loading_tasks,0},
{ns_heart,current_status,0},
{ns_heart,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
[ns_server:warn,2013-02-28T13:36:19.112,ns_1@127.0.0.1:mb_master<0.26095.0>:mb_master:handle_info:232]Skipped 1 heartbeats
[ns_server:warn,2013-02-28T13:36:26.792,ns_1@127.0.0.1:mb_master<0.26095.0>:mb_master:handle_info:232]Skipped 3 heartbeats
[ns_server:warn,2013-02-28T13:36:35.218,ns_1@127.0.0.1:mb_master<0.26095.0>:mb_master:handle_info:232]Skipped 2 heartbeats
[error_logger:info,2013-02-28T13:36:35.295,ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
=========================PROGRESS REPORT=========================
supervisor: {local,ns_server_sup}
started: [{pid,<0.26105.0>},
{name,master_activity_events_keeper},
{mfargs,{master_activity_events_keeper,start_link,[]}},
{restart_type,permanent},
{shutdown,brutal_kill},
{child_type,worker}]
[ns_server:error,2013-02-28T13:36:35.359,ns_1@127.0.0.1:ns_heart<0.26082.0>:ns_heart:grab_samples_loading_tasks:328]Failed to grab samples loader tasks: {exit,
{noproc,
{gen_server,call,
[samples_loader_tasks,get_tasks,
2000]}},
[{gen_server,call,3}
,
,
,
,
,
]}
I also see that there are several crashes on the nodes, even if they are clean(should we fix it? separate bug?)