Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7014

nodes are not cleaned within a minute after removing from cluster

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.0
    • 2.0-beta-2
    • ns_server
    • Security Level: Public
    • None

    Description

      now in our cleanup method we wait while ejected nodes will be cleaned fully,
      as example

      for removed in [node for node in nodes if (node.id != master_id)]:
      removed.rest_password = servers[0].rest_password
      removed.rest_username = servers[0].rest_username
      rest = RestConnection(removed)
      start = time.time()
      while time.time() - start < 30:
      if len(rest.get_pools_info()["pools"]) == 0:
      success_cleaned.append(removed)
      break
      else:
      time.sleep(0.1)
      if time.time() - start > 10:
      log.error("'pools' on node

      {0}

      :

      {1}

      -

      {2}

      ".format(
      removed.ip, removed.port, rest.get_pools_info()["pools"]))

      but 30 seconds is not enough, for instance:
      http://qa.hq.northscale.net/job/ubuntu-32-2.0-swaprebalance-tests-P0/305/consoleFull

      [2012-10-25 07:01:05,403] - [cluster_helper:199] INFO - rebalancing all nodes in order to remove nodes
      [2012-10-25 07:01:05,408] - [rest_client:881] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.2.153%2Cns_1%4010.3.2.158%2Cns_1%4010.3.2.155&user=Administrator&knownNodes=ns_1%4010.3.2.157%2Cns_1%4010.3.2.153%2Cns_1%4010.3.2.158%2Cns_1%4010.3.2.155
      [2012-10-25 07:01:05,419] - [rest_client:888] INFO - rebalance operation started
      [2012-10-25 07:01:05,430] - [rest_client:984] INFO - rebalance percentage : 0 %
      [2012-10-25 07:01:09,441] - [rest_client:941] INFO - rebalance progress took 4.02120304108 seconds
      [2012-10-25 07:01:09,441] - [rest_client:942] INFO - sleep for 4.02120304108 seconds after rebalance...
      [2012-10-25 07:01:43,577] - [cluster_helper:219] ERROR - 'pools' on node 10.3.2.153:8091 - [

      {u'streamingUri': u'/poolsStreaming/default?uuid=e9bfd0f8cffab71ff7cfc58b239d4b67', u'name': u'default', u'uri': u'/pools/default?uuid=e9bfd0f8cffab71ff7cfc58b239d4b67'}

      ]
      [2012-10-25 07:01:43,609] - [cluster_helper:222] ERROR - node 10.3.2.155:8091 was not cleaned after removing from cluster
      ERROR

      ======================================================================
      ERROR: test_failover_swap_rebalance (swaprebalance.SwapRebalanceFailedTests)
      ----------------------------------------------------------------------
      Traceback (most recent call last):
      File "pytests/swaprebalance.py", line 585, in tearDown
      SwapRebalanceBase.common_tearDown(self)
      File "pytests/swaprebalance.py", line 65, in common_tearDown
      SwapRebalanceBase.reset(self)
      File "pytests/swaprebalance.py", line 76, in reset
      ClusterOperationHelper.cleanup_cluster([server])
      File "lib/membase/helper/cluster_helper.py", line 225, in cleanup_cluster
      raise Exception("not all ejected nodes were cleaned successfully")
      Exception: not all ejected nodes were cleaned successfully

      we remove the buckets, then we rebalance out all nodes, and then we find out that in one node there is still pool

      there are a lot failures with the same error on cluster run(buildot).

      Alk, if we join ejected node to cluster immediately, then apparently our tests will fall with errors, for example: ["Prepare join failed. Timeout connecting...

      logs from 10.3.2.153 after we ejected it from cluster

      [user:info,2012-10-25T7:03:48.085,ns_1@10.3.2.153:<0.20935.2>:ns_orchestrator:idle:492]Starting rebalance, KeepNodes = ['ns_1@10.3.2.157'], EjectNodes = ['ns_1@10.3.2.153',
      'ns_1@10.3.2.158',
      'ns_1@10.3.2.155']

      [ns_server:debug,2012-10-25T7:03:48.085,ns_1@10.3.2.153:ns_config_rep<0.27824.1>:ns_config_rep:do_push_keys:317]Replicating some config keys ([

      {node,'ns_1@10.3.2.157',membership}

      ]..)
      [ns_server:debug,2012-10-25T7:03:48.090,ns_1@10.3.2.153:ns_config_rep<0.27824.1>:ns_config_rep:do_push_keys:317]Replicating some config keys ([counters,rebalance_status,rebalancer_pid]..)
      [ns_server:debug,2012-10-25T7:03:48.093,ns_1@10.3.2.153:ns_config_log<0.344.0>:ns_config_log:log_common:111]config change:
      counters ->
      [

      {rebalance_start,14}

      ,

      {rebalance_success,13}

      ,

      {failover_node,10}

      ]
      [ns_server:debug,2012-10-25T7:03:48.098,ns_1@10.3.2.153:ns_config_log<0.344.0>:ns_config_log:log_common:111]config change:
      rebalancer_pid ->
      <0.21667.2>
      [ns_server:debug,2012-10-25T7:03:48.100,ns_1@10.3.2.153:ns_config_log<0.344.0>:ns_config_log:log_common:111]config change:
      rebalance_status ->
      running
      [rebalance:debug,2012-10-25T7:03:48.104,ns_1@10.3.2.153:<0.21667.2>:ns_rebalancer:rebalance:231]BucketConfigs = []
      [cluster:debug,2012-10-25T7:03:48.115,ns_1@10.3.2.153:<0.21667.2>:ns_cluster:leave:216]Asking node 'ns_1@10.3.2.158' to leave the cluster
      [ns_server:debug,2012-10-25T7:03:48.115,ns_1@10.3.2.153:ns_config_log<0.344.0>:ns_config_log:log_common:111]config change:

      {node,'ns_1@10.3.2.158',membership} ->
      inactiveFailed
      [cluster:debug,2012-10-25T7:03:48.116,ns_1@10.3.2.153:<0.21667.2>:ns_cluster:shun:235]Shunning 'ns_1@10.3.2.158'
      [ns_server:debug,2012-10-25T7:03:48.116,ns_1@10.3.2.153:ns_config_rep<0.27824.1>:ns_config_rep:do_push_keys:317]Replicating some config keys ([{node,'ns_1@10.3.2.158',membership}

      ]..)
      [cluster:debug,2012-10-25T7:03:48.117,ns_1@10.3.2.153:<0.21667.2>:ns_cluster:leave:216]Asking node 'ns_1@10.3.2.155' to leave the cluster
      [cluster:debug,2012-10-25T7:03:48.118,ns_1@10.3.2.153:<0.21667.2>:ns_cluster:shun:235]Shunning 'ns_1@10.3.2.155'
      [user:info,2012-10-25T7:03:48.119,ns_1@10.3.2.153:<0.20935.2>:ns_orchestrator:handle_info:309]Rebalance completed successfully.

      almost a minute 10.3.2.153 is trying to get different information from other nodes:

      [ns_server:info,2012-10-25T7:04:35.938,ns_1@10.3.2.153:ns_config_rep<0.27824.1>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.3.2.157'

      [ns_server:debug,2012-10-25T7:04:36.270,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:current_status:139]Ignoring failure to grab system stats:
      {'EXIT',{noproc,{gen_server,call,
      [

      {'stats_reader-@system','ns_1@10.3.2.153'},
      {latest,"minute"}]}}}

      [ns_server:debug,2012-10-25T7:04:36.270,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_local_xdcr_replications:282]Ignoring exception getting xdcr replication infos
      {exit,{noproc,{gen_server,call,[xdc_replication_sup,which_children,infinity]}},
      [{gen_server,call,3},
      {xdc_replication_sup,all_local_replication_infos,0},
      {ns_heart,grab_local_xdcr_replications,0},
      {ns_heart,current_status,0},
      {ns_heart,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      [ns_server:error,2012-10-25T7:04:36.271,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_samples_loading_tasks:294]Failed to grab samples loader tasks: {exit,
      {noproc,
      {gen_server,call,
      [samples_loader_tasks,get_tasks,
      2000]}},
      [{gen_server,call,3},
      {ns_heart,grab_samples_loading_tasks,0},
      {ns_heart,current_status,0},
      {ns_heart,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      [ns_server:debug,2012-10-25T7:04:41.269,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:current_status:139]Ignoring failure to grab system stats:
      {'EXIT',{noproc,{gen_server,call,
      [{'stats_reader-@system','ns_1@10.3.2.153'}

      ,

      {latest,"minute"}]}}}

      [ns_server:debug,2012-10-25T7:04:41.271,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_local_xdcr_replications:282]Ignoring exception getting xdcr replication infos
      {exit,{noproc,{gen_server,call,[xdc_replication_sup,which_children,infinity]}},
      [{gen_server,call,3},
      {xdc_replication_sup,all_local_replication_infos,0},
      {ns_heart,grab_local_xdcr_replications,0},
      {ns_heart,current_status,0},
      {ns_heart,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      [ns_server:error,2012-10-25T7:04:41.271,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_samples_loading_tasks:294]Failed to grab samples loader tasks: {exit,
      {noproc,
      {gen_server,call,
      [samples_loader_tasks,get_tasks,
      2000]}},
      [{gen_server,call,3},
      {ns_heart,grab_samples_loading_tasks,0},
      {ns_heart,current_status,0},
      {ns_heart,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      [ns_server:debug,2012-10-25T7:04:46.269,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:current_status:139]Ignoring failure to grab system stats:
      {'EXIT',{noproc,{gen_server,call,
      [{'stats_reader-@system','ns_1@10.3.2.153'},
      {latest,"minute"}

      ]}}}

      [ns_server:debug,2012-10-25T7:04:46.270,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_local_xdcr_replications:282]Ignoring exception getting xdcr replication infos
      {exit,{noproc,{gen_server,call,[xdc_replication_sup,which_children,infinity]}},
      [

      {gen_server,call,3},
      {xdc_replication_sup,all_local_replication_infos,0},
      {ns_heart,grab_local_xdcr_replications,0},
      {ns_heart,current_status,0},
      {ns_heart,handle_info,2},
      {gen_server,handle_msg,5},
      {proc_lib,init_p_do_apply,3}]}
      [ns_server:error,2012-10-25T7:04:46.270,ns_1@10.3.2.153:ns_heart<0.27845.1>:ns_heart:grab_samples_loading_tasks:294]Failed to grab samples loader tasks: {exit,
      {noproc,
      {gen_server,call,
      [samples_loader_tasks,get_tasks,
      2000]}},
      [{gen_server,call,3}

      ,

      {ns_heart,grab_samples_loading_tasks,0}

      ,

      {ns_heart,current_status,0}

      ,

      {ns_heart,handle_info,2}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}
      [error_logger:error,2012-10-25T7:04:46.296,ns_1@10.3.2.153:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server remote_clusters_info terminating

        • Last message in was gc
        • When Server state == {state,"/opt/couchbase/var/lib/couchbase/remote_clusters_cache",
          {set,0,16,16,8,80,48,
          {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}

          ,
          {{[],[],[],[],[],[],[],[],[],[],[],[],[],
          [],[],[]}}}}

        • Reason for termination ==
        • {noproc,
          Unknown macro: {gen_server,call, [xdc_rdoc_replication_srv, {foreach_doc,#Fun<xdc_rdoc_replication_srv.2.51124823>}, infinity]}

          }

      [error_logger:error,2012-10-25T7:04:46.314,ns_1@10.3.2.153:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: remote_clusters_info:init/1
      pid: <0.27851.1>
      registered_name: remote_clusters_info
      exception exit: {noproc,
      {gen_server,call,
      [xdc_rdoc_replication_srv,

      {foreach_doc, #Fun<xdc_rdoc_replication_srv.2.51124823>}

      ,
      infinity]}}
      in function gen_server:terminate/6
      ancestors: [ns_server_sup,ns_server_cluster_sup,<0.59.0>]
      messages: []
      links: [<0.27811.1>]
      dictionary: []
      trap_exit: false
      status: running
      heap_size: 121393
      stack_size: 24
      reductions: 13648
      neighbours:

      [ns_server:warn,2012-10-25T7:04:48.177,ns_1@10.3.2.153:<0.27904.1>:supervisor_cushion:terminate:97]Cushioned process <0.27906.1> failed to terminate within 60000ms. Killing it brutally.
      [ns_server:debug,2012-10-25T7:04:48.178,ns_1@10.3.2.153:<0.21657.2>:ns_port_server:terminate:141]Sending 'shutdown' to port
      [ns_server:info,2012-10-25T7:04:48.179,ns_1@10.3.2.153:<0.21657.2>:ns_port_server:handle_info:104]Port server moxi exited with status 0
      [ns_server:info,2012-10-25T7:04:48.179,ns_1@10.3.2.153:<0.21657.2>:ns_port_server:log:169]moxi<0.21657.2>: EOL on stdin. Exiting

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            alkondratenko Aleksey Kondratenko (Inactive)
            andreibaranouski Andrei Baranouski
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty