Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 2.0
Affects Version/s: 2.0-beta-2
Component/s: ns_server
Security Level: Public
Labels:
None

Description

now in our cleanup method we wait while ejected nodes will be cleaned fully,
as example

for removed in [node for node in nodes if (node.id != master_id)]:
removed.rest_password = servers[0].rest_password
removed.rest_username = servers[0].rest_username
rest = RestConnection(removed)
start = time.time()
while time.time() - start < 30:
if len(rest.get_pools_info()["pools"]) == 0:
success_cleaned.append(removed)
break
else:
time.sleep(0.1)
if time.time() - start > 10:
log.error("'pools' on node

{0}

:

{1}

-

{2}

".format(
removed.ip, removed.port, rest.get_pools_info()["pools"]))

but 30 seconds is not enough, for instance:
http://qa.hq.northscale.net/job/ubuntu-32-2.0-swaprebalance-tests-P0/305/consoleFull

[2012-10-25 07:01:05,403] - [cluster_helper:199] INFO - rebalancing all nodes in order to remove nodes
[2012-10-25 07:01:05,408] - [rest_client:881] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.2.153%2Cns_1%4010.3.2.158%2Cns_1%4010.3.2.155&user=Administrator&knownNodes=ns_1%4010.3.2.157%2Cns_1%4010.3.2.153%2Cns_1%4010.3.2.158%2Cns_1%4010.3.2.155
[2012-10-25 07:01:05,419] - [rest_client:888] INFO - rebalance operation started
[2012-10-25 07:01:05,430] - [rest_client:984] INFO - rebalance percentage : 0 %
[2012-10-25 07:01:09,441] - [rest_client:941] INFO - rebalance progress took 4.02120304108 seconds
[2012-10-25 07:01:09,441] - [rest_client:942] INFO - sleep for 4.02120304108 seconds after rebalance...
[2012-10-25 07:01:43,577] - [cluster_helper:219] ERROR - 'pools' on node 10.3.2.153:8091 - [

{u'streamingUri': u'/poolsStreaming/default?uuid=e9bfd0f8cffab71ff7cfc58b239d4b67', u'name': u'default', u'uri': u'/pools/default?uuid=e9bfd0f8cffab71ff7cfc58b239d4b67'}

]
[2012-10-25 07:01:43,609] - [cluster_helper:222] ERROR - node 10.3.2.155:8091 was not cleaned after removing from cluster
ERROR

======================================================================
ERROR: test_failover_swap_rebalance (swaprebalance.SwapRebalanceFailedTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "pytests/swaprebalance.py", line 585, in tearDown
SwapRebalanceBase.common_tearDown(self)
File "pytests/swaprebalance.py", line 65, in common_tearDown
SwapRebalanceBase.reset(self)
File "pytests/swaprebalance.py", line 76, in reset
ClusterOperationHelper.cleanup_cluster([server])
File "lib/membase/helper/cluster_helper.py", line 225, in cleanup_cluster
raise Exception("not all ejected nodes were cleaned successfully")
Exception: not all ejected nodes were cleaned successfully

we remove the buckets, then we rebalance out all nodes, and then we find out that in one node there is still pool

there are a lot failures with the same error on cluster run(buildot).

Alk, if we join ejected node to cluster immediately, then apparently our tests will fall with errors, for example: ["Prepare join failed. Timeout connecting...

logs from 10.3.2.153 after we ejected it from cluster

[user:info,2012-10-25T7:03:48.085,ns_1@10.3.2.153:<0.20935.2>:ns_orchestrator:idle:492]Starting rebalance, KeepNodes = ['ns_1@10.3.2.157'], EjectNodes = ['ns_1@10.3.2.153',
'ns_1@10.3.2.158',
'ns_1@10.3.2.155']

[ns_server:debug,2012-10-25T7:03:48.085,ns_1@10.3.2.153:ns_config_rep<0.27824.1>:ns_config_rep:do_push_keys:317]Replicating some config keys ([

{node,'ns_1@10.3.2.157',membership}

]..)
[ns_server:debug,2012-10-25T7:03:48.090,ns_1@10.3.2.153:ns_config_rep<0.27824.1>:ns_config_rep:do_push_keys:317]Replicating some config keys ([counters,rebalance_status,rebalancer_pid]..)
[ns_server:debug,2012-10-25T7:03:48.093,ns_1@10.3.2.153:ns_config_log<0.344.0>:ns_config_log:log_common:111]config change:
counters ->
[

{rebalance_start,14}

,

{rebalance_success,13}

,

{failover_node,10}

]
[ns_server:debug,2012-10-25T7:03:48.098,ns_1@10.3.2.153:ns_config_log<0.344.0>:ns_config_log:log_common:111]config change:
rebalancer_pid ->
<0.21667.2>
[ns_server:debug,2012-10-25T7:03:48.100,ns_1@10.3.2.153:ns_config_log<0.344.0>:ns_config_log:log_common:111]config change:
rebalance_status ->
running
[rebalance:debug,2012-10-25T7:03:48.104,ns_1@10.3.2.153:<0.21667.2>:ns_rebalancer:rebalance:231]BucketConfigs = []
[cluster:debug,2012-10-25T7:03:48.115,ns_1@10.3.2.153:<0.21667.2>:ns_cluster:leave:216]Asking node 'ns_1@10.3.2.158' to leave the cluster
[ns_server:debug,2012-10-25T7:03:48.115,ns_1@10.3.2.153:ns_config_log<0.344.0>:ns_config_log:log_common:111]config change:

{node,'ns_1@10.3.2.158',membership} ->
inactiveFailed
[cluster:debug,2012-10-25T7:03:48.116,ns_1@10.3.2.153:<0.21667.2>:ns_cluster:shun:235]Shunning 'ns_1@10.3.2.158'
[ns_server:debug,2012-10-25T7:03:48.116,ns_1@10.3.2.153:ns_config_rep<0.27824.1>:ns_config_rep:do_push_keys:317]Replicating some config keys ([{node,'ns_1@10.3.2.158',membership}

]..)
[cluster:debug,2012-10-25T7:03:48.117,ns_1@10.3.2.153:<0.21667.2>:ns_cluster:leave:216]Asking node 'ns_1@10.3.2.155' to leave the cluster
[cluster:debug,2012-10-25T7:03:48.118,ns_1@10.3.2.153:<0.21667.2>:ns_cluster:shun:235]Shunning 'ns_1@10.3.2.155'
[user:info,2012-10-25T7:03:48.119,ns_1@10.3.2.153:<0.20935.2>:ns_orchestrator:handle_info:309]Rebalance completed successfully.

almost a minute 10.3.2.153 is trying to get different information from other nodes:

[ns_server:info,2012-10-25T7:04:35.938,ns_1@10.3.2.153:ns_config_rep<0.27824.1>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.3.2.157'

[ns_server:debug,2012-10-25T7:04:36.270,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:current_status:139]Ignoring failure to grab system stats:
{'EXIT',{noproc,{gen_server,call,
[

{'stats_reader-@system','ns_1@10.3.2.153'},
{latest,"minute"}]}}}

[ns_server:debug,2012-10-25T7:04:36.270,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_local_xdcr_replications:282]Ignoring exception getting xdcr replication infos
{exit,{noproc,{gen_server,call,[xdc_replication_sup,which_children,infinity]}},
[{gen_server,call,3},
{xdc_replication_sup,all_local_replication_infos,0},
{ns_heart,grab_local_xdcr_replications,0},
{ns_heart,current_status,0},
{ns_heart,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
[ns_server:error,2012-10-25T7:04:36.271,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_samples_loading_tasks:294]Failed to grab samples loader tasks: {exit,
{noproc,
{gen_server,call,
[samples_loader_tasks,get_tasks,
2000]}},
[{gen_server,call,3},
{ns_heart,grab_samples_loading_tasks,0},
{ns_heart,current_status,0},
{ns_heart,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
[ns_server:debug,2012-10-25T7:04:41.269,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:current_status:139]Ignoring failure to grab system stats:
{'EXIT',{noproc,{gen_server,call,
[{'stats_reader-@system','ns_1@10.3.2.153'}

,

{latest,"minute"}]}}}

[ns_server:debug,2012-10-25T7:04:41.271,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_local_xdcr_replications:282]Ignoring exception getting xdcr replication infos
{exit,{noproc,{gen_server,call,[xdc_replication_sup,which_children,infinity]}},
[{gen_server,call,3},
{xdc_replication_sup,all_local_replication_infos,0},
{ns_heart,grab_local_xdcr_replications,0},
{ns_heart,current_status,0},
{ns_heart,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
[ns_server:error,2012-10-25T7:04:41.271,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_samples_loading_tasks:294]Failed to grab samples loader tasks: {exit,
{noproc,
{gen_server,call,
[samples_loader_tasks,get_tasks,
2000]}},
[{gen_server,call,3},
{ns_heart,grab_samples_loading_tasks,0},
{ns_heart,current_status,0},
{ns_heart,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
[ns_server:debug,2012-10-25T7:04:46.269,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:current_status:139]Ignoring failure to grab system stats:
{'EXIT',{noproc,{gen_server,call,
[{'stats_reader-@system','ns_1@10.3.2.153'},
{latest,"minute"}

]}}}

[ns_server:debug,2012-10-25T7:04:46.270,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_local_xdcr_replications:282]Ignoring exception getting xdcr replication infos
{exit,{noproc,{gen_server,call,[xdc_replication_sup,which_children,infinity]}},
[

{gen_server,call,3},
{xdc_replication_sup,all_local_replication_infos,0},
{ns_heart,grab_local_xdcr_replications,0},
{ns_heart,current_status,0},
{ns_heart,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
[ns_server:error,2012-10-25T7:04:46.270,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_samples_loading_tasks:294]Failed to grab samples loader tasks: {exit,
{noproc,
{gen_server,call,
[samples_loader_tasks,get_tasks,
2000]}},
[{gen_server,call,3}

,

{ns_heart,grab_samples_loading_tasks,0}

,

{ns_heart,current_status,0}

,

{ns_heart,handle_info,2}

,

{gen_server,handle_msg,5}

,

{proc_lib,init_p_do_apply,3}

]}
[error_logger:error,2012-10-25T7:04:46.296,ns_1@10.3.2.153:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server remote_clusters_info terminating

- Last message in was gc
- When Server state == {state,"/opt/couchbase/var/lib/couchbase/remote_clusters_cache",
  {set,0,16,16,8,80,48,
  {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}
  ,
  {{[],[],[],[],[],[],[],[],[],[],[],[],[],
  [],[],[]}}}}
- Reason for termination ==
- {noproc,
  Unknown macro: {gen_server,call, [xdc_rdoc_replication_srv, {foreach_doc,#Fun<xdc_rdoc_replication_srv.2.51124823>}, infinity]}
  
  }

[error_logger:error,2012-10-25T7:04:46.314,ns_1@10.3.2.153:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: remote_clusters_info:init/1
pid: <0.27851.1>
registered_name: remote_clusters_info
exception exit: {noproc,
{gen_server,call,
[xdc_rdoc_replication_srv,

{foreach_doc, #Fun<xdc_rdoc_replication_srv.2.51124823>}

,
infinity]}}
in function gen_server:terminate/6
ancestors: [ns_server_sup,ns_server_cluster_sup,<0.59.0>]
messages: []
links: [<0.27811.1>]
dictionary: []
trap_exit: false
status: running
heap_size: 121393
stack_size: 24
reductions: 13648
neighbours:

[ns_server:warn,2012-10-25T7:04:48.177,ns_1@10.3.2.153:<0.27904.1>:supervisor_cushion:terminate:97]Cushioned process <0.27906.1> failed to terminate within 60000ms. Killing it brutally.
[ns_server:debug,2012-10-25T7:04:48.178,ns_1@10.3.2.153:<0.21657.2>:ns_port_server:terminate:141]Sending 'shutdown' to port
[ns_server:info,2012-10-25T7:04:48.179,ns_1@10.3.2.153:<0.21657.2>:ns_port_server:handle_info:104]Port server moxi exited with status 0
[ns_server:info,2012-10-25T7:04:48.179,ns_1@10.3.2.153:<0.21657.2>:ns_port_server:log:169]moxi<0.21657.2>: EOL on stdin. Exiting

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

730193b9-e281-4b1e-97f2-01fd0ff53570-10.3.2.153-diag.txt.gz
1.97 MB
25/Oct/12 8:26 AM
730193b9-e281-4b1e-97f2-01fd0ff53570-10.3.2.154-diag.txt.gz
5.09 MB
25/Oct/12 8:26 AM
730193b9-e281-4b1e-97f2-01fd0ff53570-10.3.2.155-diag.txt.gz
1.88 MB
25/Oct/12 8:26 AM
730193b9-e281-4b1e-97f2-01fd0ff53570-10.3.2.156-diag.txt.gz
3.36 MB
25/Oct/12 8:26 AM
730193b9-e281-4b1e-97f2-01fd0ff53570-10.3.2.157-diag.txt.gz
3.64 MB
25/Oct/12 8:26 AM
730193b9-e281-4b1e-97f2-01fd0ff53570-10.3.2.158-diag.txt.gz
2.53 MB
25/Oct/12 8:26 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

nodes are not cleaned within a minute after removing from cluster

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty