Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7590

[system test] rebalance failed due to buckets shutdown in orchestrator node

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
    • Environment:
      windows R2 2008 64bit

      Description

      Environment:

      • 9 windows 2008 R2 64bit.
      • Each server has 4 CPU, 8GB RAM and SSD disk
      • Cluster has 2 buckets, default and sasl bucket with consistent view enable.
      • Load 26 million items to default bucket and 16 million items to sasl bucket. Each key has size from 128 to 512 bytes
      • Each bucket has one doc and 2 views for each doc.
      • Rebalance out 2 nodes 10.3.121.173 and 10.3.121.243

      Starting rebalance, KeepNodes = ['ns_1@10.3.3.181','ns_1@10.3.121.47',
      'ns_1@10.3.3.214','ns_1@10.3.3.182',
      'ns_1@10.3.3.180','ns_1@10.3.121.171',
      'ns_1@10.3.121.169'], EjectNodes = ['ns_1@10.3.121.173',
      'ns_1@10.3.121.243'] ns_orchestrator004 ns_1@10.3.121.169 23:26:03 - Tue Jan 22, 2013

      • Rebalance failed due to buckets were shutting down on orchestrator node.

      ns_server:debug,2013-01-23T8:29:27.672,ns_1@10.3.121.169:ns_config_log<0.803.0>:ns_config_log:log_common:111]config change:
      rebalance_status ->

      {none,<<"Rebalance stopped by janitor.">>}

      [user:info,2013-01-23T8:29:26.219,ns_1@10.3.121.169:ns_memcached-default<0.968.1>:ns_memcached:terminate:661]Shutting down bucket "default" on 'ns_1@10.3.121.169' for server shutdown
      [ns_server:error,2013-01-23T8:29:26.219,ns_1@10.3.121.169:timeout_diag_logger<0.699.0>:timeout_diag_logger:handle_call:104]
      {<0.12009.70>,
      [

      {registered_name,[]}

      ,

      {status,waiting}

      ,
      {initial_call,{proc_lib,init_p,5}},

      {backtrace,[<<"Program counter: 0x04e7e1c8 (couch_file:reader_loop/3 + 116)">>, <<"CP: 0x00000000 (invalid)">>,<<"arity = 0">>,<<>>, <<"0x126e4ce4 Return addr 0x017a2da8 (proc_lib:init_p_do_apply/3 + 28)">>, <<"y(0) 10">>,<<"y(1) \"c:/data/sasl/109.couch.14\"">>, <<"y(2) []">>,<<>>, <<"0x126e4cf4 Return addr 0x00b409b4 (<terminate process normally>)">>, <<"y(0) Catch 0x017a2db8 (proc_lib:init_p_do_apply/3 + 44)">>, <<>>]}

      ,

      {error_handler,error_handler}

      ,
      {garbage_collection,[

      {min_bin_vheap_size,46368}

      ,

      {min_heap_size,233}

      ,

      {fullsweep_after,512}

      ,

      {minor_gcs,403}

      ]},

      {heap_size,377}

      ,

      {total_heap_size,754}

      ,

      {links,[<0.12008.70>]}

      ,

      {memory,3496}

      ,

      {message_queue_len,0}

      ,

      {reductions,216588}

      ,

      {trap_exit,true}

      ]}

      [ns_server:debug,2013-01-23T8:29:27.313,ns_1@10.3.121.169:<0.835.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

      {buckets_events,<0.833.0>}

      exited with reason {shutdown,
      {gen_server,
      call,
      ['ns_vbm_new_sup-sasl',
      which_children,
      infinity]}}
      [ns_server:debug,2013-01-23T8:29:27.313,ns_1@10.3.121.169:ns_config_log<0.803.0>:ns_config_log:log_common:111]config change:
      rebalancer_pid ->
      undefined
      [ns_server:debug,2013-01-23T8:29:27.329,ns_1@10.3.121.169:capi_set_view_manager-sasl<0.8923.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
      [user:info,2013-01-23T8:29:27.329,ns_1@10.3.121.169:ns_memcached-sasl<0.8955.0>:ns_memcached:terminate:661]Shutting down bucket "sasl" on 'ns_1@10.3.121.169' for server shutdown
      [ns_server:debug,2013-01-23T8:29:27.344,ns_1@10.3.121.169:ns_config_log<0.803.0>:ns_config_log:log_common:111]config change:
      auto_failover_cfg ->
      [

      {enabled,false}

      ,

      {timeout,30}

      ,

      {max_nodes,1}

      ,

      {count,0}

      ]
      [ns_server:debug,2013-01-23T8:29:27.360,ns_1@10.3.121.169:ns_config_rep<0.31635.76>:ns_config_rep:do_push_keys:317]Replicating some config keys ([auto_failover_cfg,autocompaction,buckets,
      cluster_compat_version,counters,
      dynamic_config_version]..)
      [ns_server:debug,2013-01-23T8:29:27.360,ns_1@10.3.121.169:capi_set_view_manager-sasl<0.8923.0>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
      [ns_server:error,2013-01-23T8:29:27.360,ns_1@10.3.121.169:timeout_diag_logger<0.699.0>:timeout_diag_logger:handle_call:104]
      {<0.10831.67>,

      • Memcached logs at time around rebalance failed

      Wed Jan 23 08:29:27.208484 Pacific Standard Time 3: TAP (Consumer) eq_tapq:anon_18 - disconnected
      Wed Jan 23 08:29:27.286609 Pacific Standard Time 3: TAP (Consumer) eq_tapq:anon_20 - disconnected
      Wed Jan 23 08:29:28.145984 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:anon_17"
      Wed Jan 23 08:29:28.161609 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:anon_18"
      Wed Jan 23 08:29:28.161609 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:anon_19"
      Wed Jan 23 08:29:28.161609 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:anon_20"
      Wed Jan 23 08:29:28.177234 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:anon_21"
      Wed Jan 23 08:29:28.177234 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:anon_22"
      Wed Jan 23 08:29:28.192859 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:anon_23"
      Wed Jan 23 08:29:28.208484 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:anon_24"
      Wed Jan 23 08:29:29.005359 Pacific Standard Time 3: Shutting down tap connections!
      Wed Jan 23 08:29:29.005359 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.171"
      Wed Jan 23 08:29:29.083484 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.182"
      Wed Jan 23 08:29:29.083484 Pacific Standard Time 3: Failed to notify thread: Unknown error
      Wed Jan 23 08:29:29.114734 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.47"
      Wed Jan 23 08:29:29.114734 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.171 - Clear the tap queues by force
      Wed Jan 23 08:29:29.114734 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.214"
      Wed Jan 23 08:29:29.114734 Pacific Standard Time 3: Failed to notify thread: Unknown error
      Wed Jan 23 08:29:29.114734 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.180"
      Wed Jan 23 08:29:29.114734 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.182 - Clear the tap queues by force
      Wed Jan 23 08:29:29.114734 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.181"
      Wed Jan 23 08:29:29.130359 Pacific Standard Time 3: Failed to notify thread: Unknown error
      Wed Jan 23 08:29:29.130359 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.47 - Clear the tap queues by force
      Wed Jan 23 08:29:29.130359 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.214 - Clear the tap queues by force
      Wed Jan 23 08:29:29.130359 Pacific Standard Time 3: Failed to notify thread: Unknown error
      Wed Jan 23 08:29:29.130359 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.180 - Clear the tap queues by force
      Wed Jan 23 08:29:29.130359 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.181 - Clear the tap queues by force
      Wed Jan 23 08:29:42.130359 Pacific Standard Time 3: Had to wait 12 s for shutdown
      Wed Jan 23 08:30:01.442859 Pacific Standard Time 3: Shutting down tap connections!
      Wed Jan 23 08:30:01.442859 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.47"
      Wed Jan 23 08:30:01.505359 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.121.171"
      Wed Jan 23 08:30:01.505359 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.47 - Clear the tap queues by force
      Wed Jan 23 08:30:01.505359 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.181"
      Wed Jan 23 08:30:01.505359 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.171 - Clear the tap queues by force
      Wed Jan 23 08:30:01.520984 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.214"
      Wed Jan 23 08:30:01.520984 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.181 - Clear the tap queues by force
      Wed Jan 23 08:30:01.520984 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.180"
      Wed Jan 23 08:30:01.536609 Pacific Standard Time 3: Schedule cleanup of "eq_tapq:replication_ns_1@10.3.3.182"
      Wed Jan 23 08:30:01.520984 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.214 - Clear the tap queues by force
      Wed Jan 23 08:30:01.536609 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.180 - Clear the tap queues by force
      Wed Jan 23 08:30:01.536609 Pacific Standard Time 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.3.182 - Clear the tap queues by force
      Wed Jan 23 08:30:16.536609 Pacific Standard Time 3: Had to wait 15 s for shutdown

      Link to manifest file of this build http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.1-140-rel.setup.exe.manifest.xml

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          thuan Thuan Nguyen added a comment -

          Integrated in github-ep-engine-2-0 #492 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/492/)
          MB-7590: Remove the mutation key log (Revision 0ed6d09a1074f93f5626cfe0c6ab29f2b605009c)

          Result = SUCCESS
          Mike Wiederhold :
          Files :

          • src/warmup.h
          • docs/stats.org
          • management/cbstats
          • src/ep_engine.cc
          • src/ep.h
          • src/warmup.cc
          • docs/engine-params.org
          • src/iomanager/iomanager.cc
          • src/tasks.cc
          • src/ep.cc
          • src/tasks.h
          • tests/ep_testsuite.cc
          • src/iomanager/iomanager.h
          • configuration.json
          • management/cbepctl
          Show
          thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #492 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/492/ ) MB-7590 : Remove the mutation key log (Revision 0ed6d09a1074f93f5626cfe0c6ab29f2b605009c) Result = SUCCESS Mike Wiederhold : Files : src/warmup.h docs/stats.org management/cbstats src/ep_engine.cc src/ep.h src/warmup.cc docs/engine-params.org src/iomanager/iomanager.cc src/tasks.cc src/ep.cc src/tasks.h tests/ep_testsuite.cc src/iomanager/iomanager.h configuration.json management/cbepctl
          Hide
          farshid Farshid Ghods (Inactive) added a comment - - edited
          Show
          farshid Farshid Ghods (Inactive) added a comment - - edited MB-7658
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          per bug scrub with Alk and Siri,

          Tony is going to set up a 4 node windows physical cluster to see if these issues are reproducible on that environment

          Show
          farshid Farshid Ghods (Inactive) added a comment - per bug scrub with Alk and Siri, Tony is going to set up a 4 node windows physical cluster to see if these issues are reproducible on that environment
          Show
          thuan Thuan Nguyen added a comment - Link to diags of 2 nodes, 169 and 171 https://s3.amazonaws.com/packages.couchbase/collect_info/2_0_1/201301/2nodes-201-140-reb-failed-resetting-since-not-running-20130128-173700.tgz
          Hide
          thuan Thuan Nguyen added a comment - - edited
            • At ,2013-01-24T10:58:55, node 171 orchestrator got terminated due to timeout

          [error_logger:error,2013-01-24T10:58:55.308,ns_1@10.3.121.171:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** State machine mb_master terminating

            • Last message in was send_heartbeat
            • When State == master
            • Data == {state,<0.26266.225>,'ns_1@10.3.121.171',
              ['ns_1@10.3.121.169','ns_1@10.3.121.171',
              'ns_1@10.3.121.243','ns_1@10.3.121.47',
              'ns_1@10.3.3.180','ns_1@10.3.3.181',
              'ns_1@10.3.3.182','ns_1@10.3.3.214'],
              Unknown macro: {1359,53929,277889}

              }

            • Reason for termination =
            • {timeout,
              Unknown macro: {gen_server,call,[ns_node_disco,nodes_wanted]}

              }

          [couchdb:error,2013-01-24T10:58:55.386,ns_1@10.3.121.171:<0.28624.288>:couch_log:error:42]attempted upload of invalid JSON (set log_level to debug to log it)
          [ns_server:error,2013-01-24T10:58:55.386,ns_1@10.3.121.171:<0.12056.0>:ns_memcached:verify_report_long_call:297]call

          {stats,<<>>} took too long: 4750000 us
          [ns_server:error,2013-01-24T10:58:55.402,ns_1@10.3.121.171:ns_memcached-sasl<0.12049.0>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long: 547000 us
          [ns_server:error,2013-01-24T10:58:55.402,ns_1@10.3.121.171:<0.317.1>:ns_memcached:verify_report_long_call:297]call {stats,<<>>}

          took too long: 547000 us
          [error_logger:error,2013-01-24T10:58:55.418,ns_1@10.3.121.171:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]

            • Then seconds later at 2013-01-24T10:58:56, node 169 took over master

          [ns_server:info,2013-01-24T10:58:56.360,ns_1@10.3.121.169:ns_log<0.32472.76>:ns_log:handle_cast:125]suppressing duplicate log menelaus_web_alerts_srv:1("Metadata overhead warning. Over 56% of RAM allocated to bucket \"sasl\" on node \"10.3.121.169\" is taken up by keys and metadata.") because it's been seen 2 times in the past 6.000008 secs (last seen 3.000008 secs ago
          [user:info,2013-01-24T10:58:58.563,ns_1@10.3.121.169:mb_master<0.18378.202>:mb_master:handle_info:219]Haven't heard from a higher priority node or a master, so I'm taking over.
          [c

            • 11 minutes later, node 169 orchestrator kill rebalance process

          [user:info,2013-01-24T11:10:49.782,ns_1@10.3.121.169:ns_config<0.796.0>:ns_janitor:maybe_stop_rebalance_status:147]Resetting rebalance status since it's not really running
          [ns_server:debug,2013-01-24T11:10:49.782,ns_1@10.3.121.169:capi_set_view_manager-sasl<0.32642.76>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
          [ns_server:debug,2013-01-24T11:10:49.782,ns_1@10.3.121.169:capi_set_view_manager-default<0.32645.76>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
          [ns_server:debug,2013-01-24T11:10:49.782,ns_1@10.3.121.169:ns_config_rep<0.18336.202>:ns_config_rep:do_push_keys:317]Replicating some config keys ([rebalance_status,rebalancer_pid]..)
          [ns_server:debug,2013-01-24T11:10:49.797,ns_1@10.3.121.169:capi_set_view_manager-sasl<0.32642.76>:capi_set_view_manager:handle_info:349]doing replicate_newnodes_docs
          [ns_server:debug,2013-01-24T11:10:49.797,ns_1@10.3.121.169:ns_config_log<0.803.0>:ns_config_log:log_common:111]config change:

            • Orchestrator stopped rebalance may due to bucket not ready yet on node 171

          [ns_server:debug,2013-01-24T11:10:44.594,ns_1@10.3.121.169:<0.1900.259>:janitor_agent:new_style_query_vbucket_states_loop_next_step:121]Waiting for "default" on 'ns_1@10.3.121.171'
          [couchdb:error,2013-01-24T11:10:44.610,ns_1@10.3.121.169:<0.1211.259>:couch_log:error:42]attempted upload of invalid JSON (set log_level to debug to log it)
          [ns_server:error,2013-01-24T11:10:44.641,ns_1@10.3.121.169:<0.1891.259>:ns_janitor:cleanup_with_states:92]Bucket "default" not yet ready on ['ns_1@10.3.121.171']
          [couchdb:error,2013-01-24T11:10:44.672,ns_1@10.3.121.169:<0.2025.259>:couch_log:error:42]attempted upload of invalid JSON (set log_level to debug to log it)
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 114 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 115 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 116 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 117 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 118 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 119 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 120 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 121 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 122 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 123 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 124 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 125 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 126 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 127 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 128 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 134 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 135 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 136 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 137 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 138 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 139 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258]Setting vbucket 140 in "sasl" on 'ns_1@10.3.121.171' from dead to active.
          [ns_server:info,2013-01-24T11:10

          Show
          thuan Thuan Nguyen added a comment - - edited At ,2013-01-24T10:58:55, node 171 orchestrator got terminated due to timeout [error_logger:error,2013-01-24T10:58:55.308,ns_1@10.3.121.171:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76] ** State machine mb_master terminating Last message in was send_heartbeat When State == master Data == {state,<0.26266.225>,'ns_1@10.3.121.171', ['ns_1@10.3.121.169','ns_1@10.3.121.171', 'ns_1@10.3.121.243','ns_1@10.3.121.47', 'ns_1@10.3.3.180','ns_1@10.3.3.181', 'ns_1@10.3.3.182','ns_1@10.3.3.214'], Unknown macro: {1359,53929,277889} } Reason for termination = {timeout, Unknown macro: {gen_server,call,[ns_node_disco,nodes_wanted]} } [couchdb:error,2013-01-24T10:58:55.386,ns_1@10.3.121.171:<0.28624.288>:couch_log:error:42] attempted upload of invalid JSON (set log_level to debug to log it) [ns_server:error,2013-01-24T10:58:55.386,ns_1@10.3.121.171:<0.12056.0>:ns_memcached:verify_report_long_call:297] call {stats,<<>>} took too long: 4750000 us [ns_server:error,2013-01-24T10:58:55.402,ns_1@10.3.121.171:ns_memcached-sasl<0.12049.0>:ns_memcached:handle_info:630] handle_info(ensure_bucket,..) took too long: 547000 us [ns_server:error,2013-01-24T10:58:55.402,ns_1@10.3.121.171:<0.317.1>:ns_memcached:verify_report_long_call:297] call {stats,<<>>} took too long: 547000 us [error_logger:error,2013-01-24T10:58:55.418,ns_1@10.3.121.171:error_logger<0.6.0>:ale_error_logger_handler:log_report:72] Then seconds later at 2013-01-24T10:58:56, node 169 took over master [ns_server:info,2013-01-24T10:58:56.360,ns_1@10.3.121.169:ns_log<0.32472.76>:ns_log:handle_cast:125] suppressing duplicate log menelaus_web_alerts_srv:1("Metadata overhead warning. Over 56% of RAM allocated to bucket \"sasl\" on node \"10.3.121.169\" is taken up by keys and metadata.") because it's been seen 2 times in the past 6.000008 secs (last seen 3.000008 secs ago [user:info,2013-01-24T10:58:58.563,ns_1@10.3.121.169:mb_master<0.18378.202>:mb_master:handle_info:219] Haven't heard from a higher priority node or a master, so I'm taking over. [c 11 minutes later, node 169 orchestrator kill rebalance process [user:info,2013-01-24T11:10:49.782,ns_1@10.3.121.169:ns_config<0.796.0>:ns_janitor:maybe_stop_rebalance_status:147] Resetting rebalance status since it's not really running [ns_server:debug,2013-01-24T11:10:49.782,ns_1@10.3.121.169:capi_set_view_manager-sasl<0.32642.76>:capi_set_view_manager:handle_info:349] doing replicate_newnodes_docs [ns_server:debug,2013-01-24T11:10:49.782,ns_1@10.3.121.169:capi_set_view_manager-default<0.32645.76>:capi_set_view_manager:handle_info:349] doing replicate_newnodes_docs [ns_server:debug,2013-01-24T11:10:49.782,ns_1@10.3.121.169:ns_config_rep<0.18336.202>:ns_config_rep:do_push_keys:317] Replicating some config keys ( [rebalance_status,rebalancer_pid] ..) [ns_server:debug,2013-01-24T11:10:49.797,ns_1@10.3.121.169:capi_set_view_manager-sasl<0.32642.76>:capi_set_view_manager:handle_info:349] doing replicate_newnodes_docs [ns_server:debug,2013-01-24T11:10:49.797,ns_1@10.3.121.169:ns_config_log<0.803.0>:ns_config_log:log_common:111] config change: Orchestrator stopped rebalance may due to bucket not ready yet on node 171 [ns_server:debug,2013-01-24T11:10:44.594,ns_1@10.3.121.169:<0.1900.259>:janitor_agent:new_style_query_vbucket_states_loop_next_step:121] Waiting for "default" on 'ns_1@10.3.121.171' [couchdb:error,2013-01-24T11:10:44.610,ns_1@10.3.121.169:<0.1211.259>:couch_log:error:42] attempted upload of invalid JSON (set log_level to debug to log it) [ns_server:error,2013-01-24T11:10:44.641,ns_1@10.3.121.169:<0.1891.259>:ns_janitor:cleanup_with_states:92] Bucket "default" not yet ready on ['ns_1@10.3.121.171'] [couchdb:error,2013-01-24T11:10:44.672,ns_1@10.3.121.169:<0.2025.259>:couch_log:error:42] attempted upload of invalid JSON (set log_level to debug to log it) [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 114 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 115 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 116 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 117 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 118 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 119 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 120 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 121 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 122 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 123 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 124 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 125 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 126 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 127 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 128 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.735,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 134 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 135 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 136 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 137 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 138 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 139 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10:44.751,ns_1@10.3.121.169:<0.2436.259>:ns_janitor:do_sanify_chain:258] Setting vbucket 140 in "sasl" on 'ns_1@10.3.121.171' from dead to active. [ns_server:info,2013-01-24T11:10

            People

            • Assignee:
              thuan Thuan Nguyen
              Reporter:
              thuan Thuan Nguyen
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes