Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: 2.0
Affects Version/s: 2.0
Component/s: ns_server
Security Level: Public
Labels:
- system-test
Environment:
centos 6.2 64bit build 2.0.0-1908

Description

Cluster information:

8 centos 6.2 64bit server with 4 cores CPU
Each server has 32 GB RAM and 400 GB SSD disk.
24.8 GB RAM for couchbase server at each node
SSD disk format ext4 on /data
Each server has its own SSD drive, no disk sharing with other server.
Create cluster with 6 nodes installed couchbase server 2.0.0-1908
Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1908-rel.rpm.manifest.xml
Cluster has 2 buckets, default (12GB with 2 replica) and saslbucket (12GB with 1 replica).
Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43

Load 16 million items to default bucket and 20 million items to saslbuckett. Each key has size from 512 bytes to 1024 bytes
After done loading, wait until initial index. Disable view compaction.
After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
Queries all 4 views from 2 docs
Do swap rebalance, remove node 39, 40 and add node 44, 45.
At the end of rebalance saslbucket, rebalance exited with timeout on node 43
Then see a lot of reset connection to mccouch. Updated bug ~~MB-7046~~
Kill all loads pointing to this cluster. Node 43 did not back to stable state.
Erlang in node 43 looks like hang. Kill beam.smp and restart couchbase server on node 43 (will update this bug soon)
Waiting for node 43 completely up, disk write queue and drain rate down to 0.
Start queries and mutate items on both bucket. Waiting a few minutes and start swap rebalance.
Rebalance failed with error

2012-10-30 17:52:37.698 ns_rebalancer:0:info:message(ns_1@10.6.2.37) - Started rebalancing bucket saslbucket
2012-10-30 17:52:38.722 ns_vbucket_mover:0:info:message(ns_1@10.6.2.37) - Bucket "saslbucket" rebalance appears to be swap rebalance
2012-10-30 17:52:38.968 ns_rebalancer:0:info:message(ns_1@10.6.2.37) - Started rebalancing bucket default
2012-10-30 17:52:40.592 ns_vbucket_mover:0:info:message(ns_1@10.6.2.37) - Bucket "default" rebalance does not seem to be swap rebalance
2012-10-30 17:52:48.027 ns_vbucket_mover:0:critical:message(ns_1@10.6.2.37) - <0.4774.771> exited with {noproc,
{gen_server,call,
[

{'janitor_agent-default','ns_1@10.6.2.37'},
{if_rebalance,<0.4746.771>,
{update_vbucket_state,61,active,paused,
undefined}},
infinity]}}

Look at diag log of node 37, see error: "Partition 857 not in active nor passive set"

[ns_server:error,2012-10-30T17:52:45.760,ns_1@10.6.2.37:<0.20186.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 589040 us
[ns_server:error,2012-10-30T17:52:46.835,ns_1@10.6.2.37:<0.16711.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 963763 us
[rebalance:info,2012-10-30T17:52:47.098,ns_1@10.6.2.37:<0.4781.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 530 state change: {'ns_1@10.6.2.40',active,paused,undefined}
[rebalance:info,2012-10-30T17:52:47.153,ns_1@10.6.2.37:<0.4781.771>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 530 on ns_1@10.6.2.40
[rebalance:debug,2012-10-30T17:52:47.162,ns_1@10.6.2.37:<0.6121.771>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 44 in vbucket 530
[ns_server:debug,2012-10-30T17:52:47.203,ns_1@10.6.2.37:janitor_agent-default<0.16716.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.6121.771> (ok)
[rebalance:info,2012-10-30T17:52:47.206,ns_1@10.6.2.37:<0.6125.771>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.6.2.37 (vbucket 530)
[ns_server:debug,2012-10-30T17:52:47.225,ns_1@10.6.2.37:<0.6127.771>:capi_set_view_manager:do_wait_index_updated:608]References to wait: Ref<0.0.3553.75720> ("default", 530)
[ns_server:info,2012-10-30T17:52:47.482,ns_1@10.6.2.37:ns_config_rep<0.758.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.6.2.40'

[rebalance:info,2012-10-30T17:52:47.626,ns_1@10.6.2.37:<0.4779.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 857 state change: {'ns_1@10.6.2.43',active,paused,undefined}
[rebalance:info,2012-10-30T17:52:47.709,ns_1@10.6.2.37:<0.4779.771>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 857 on ns_1@10.6.2.43
[rebalance:debug,2012-10-30T17:52:47.739,ns_1@10.6.2.37:<0.6284.771>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 42 in vbucket 857
[ns_server:debug,2012-10-30T17:52:47.753,ns_1@10.6.2.37:janitor_agent-default<0.16716.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.6284.771> (ok)
[rebalance:info,2012-10-30T17:52:47.758,ns_1@10.6.2.37:<0.6286.771>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.6.2.37 (vbucket 857)
[ns_server:debug,2012-10-30T17:52:47.772,ns_1@10.6.2.37:<0.16695.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_config_events,<0.16694.0>} exited with reason {{badmatch,
{error,
{error, <<"Partition 857 not in active nor passive set">>}}},
[{capi_set_view_manager, handle_call, 3},
{gen_server, handle_msg, 5},
{gen_server, init_it, 6},
{proc_lib, init_p_do_apply, 3}]}
[ns_server:debug,2012-10-30T17:52:47.772,ns_1@10.6.2.37:<0.6127.771>:capi_set_view_manager:do_wait_index_updated:621]Parent died: {'DOWN',#Ref<0.0.3553.75791>,process,<0.16694.0>,
{{badmatch,
{error,
{error, <<"Partition 857 not in active nor passive set">>}}},
[{capi_set_view_manager,handle_call,3},
{gen_server,handle_msg,5},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]}}
[error_logger:error,2012-10-30T17:52:47.775,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: erlang:apply/2
pid: <0.6297.771>
registered_name: []
exception throw: {error,<<"Partition 857 not in active nor passive set">>}
in function couch_set_view:monitor_partition_update/3
in call from capi_set_view_manager:'~~do_wait_index_updated/4-fun-0~~'/4
in call from lists:foldl/3
in call from capi_set_view_manager:do_wait_index_updated/4
ancestors: ['capi_set_view_manager-default','single_bucket_sup-default',
<0.16686.0>]
messages: []
links: [<0.16694.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 6765
stack_size: 24
reductions: 229
neighbours:

[ns_server:error,2012-10-30T17:52:47.778,ns_1@10.6.2.37:<0.4781.771>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.6125.771>,
{{{{badmatch,
{error,
{error, <<"Partition 857 not in active nor passive set">>}}},
[{capi_set_view_manager,handle_call,3},
{gen_server,handle_msg,5},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
['capi_set_view_manager-default',
{wait_index_updated,857},
infinity]}},
{gen_server,call,
[{'janitor_agent-default','ns_1@10.6.2.37'}

,
{if_rebalance,<0.4746.771>,
{wait_index_updated,530}},
infinity]}}}
[ns_server:debug,2012-10-30T17:52:47.804,ns_1@10.6.2.37:<0.16705.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

{mc_couch_events,<0.16694.0>}

exited with reason {{badmatch,
{error,

{error, <<"Partition 857 not in active nor passive set">>}

}},
[

{capi_set_view_manager, handle_call, 3}

{gen_server, handle_msg, 5}

{gen_server, init_it, 6}

{proc_lib, init_p_do_apply, 3}

]}
[ns_server:debug,2012-10-30T17:52:47.804,ns_1@10.6.2.37:<0.16704.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

{ns_node_disco_events,<0.16694.0>}

exited with reason {{badmatch,
{error,

{error, <<"Partition 857 not in active nor passive set">>}

}},
[

{capi_set_view_manager, handle_call, 3}

{gen_server, handle_msg, 5}

{gen_server, init_it, 6}

{proc_lib, init_p_do_apply, 3}

]}
[ns_server:error,2012-10-30T17:52:47.808,ns_1@10.6.2.37:<0.4779.771>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.6286.771>,
{{{{badmatch,
{error,

{error, <<"Partition 857 not in active nor passive set">>}

}},
[

{capi_set_view_manager,handle_call,3}

{gen_server,handle_msg,5}

{gen_server,init_it,6}

{proc_lib,init_p_do_apply,3}

]},
{gen_server,call,
['capi_set_view_manager-default',

{wait_index_updated,857},
infinity]}},
{gen_server,call,
[{'janitor_agent-default','ns_1@10.6.2.37'},
{if_rebalance,<0.4746.771>,
{wait_index_updated,857}},
infinity]}}}
[rebalance:info,2012-10-30T17:52:47.822,ns_1@10.6.2.37:<0.5479.771>:ebucketmigrator_srv:do_confirm_sent_messages:671]Got close ack!

[rebalance:info,2012-10-30T17:52:47.827,ns_1@10.6.2.37:<0.5314.771>:ebucketmigrator_srv:do_confirm_sent_messages:671]Got close ack!

[rebalance:info,2012-10-30T17:52:47.835,ns_1@10.6.2.37:<0.4774.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 61 state change: {'ns_1@10.6.2.37',active,paused,undefined}
[error_logger:error,2012-10-30T17:52:47.811,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.16694.0> terminating
** Last message in was {wait_index_updated,857}

- When Server state == {state,"default",'capi_ddoc_replication_srv-default',
  ['ns_1@10.6.2.38','ns_1@10.6.2.42',
  'ns_1@10.6.2.43','ns_1@10.6.2.44',
  'ns_1@10.6.2.45','ns_1@10.6.2.39',
  'ns_1@10.6.2.40'],
  [{doc,<<"_design/d1">>,
  {1,<<56,131,187,118>>}
  ,
  <<"{\"views\":{\"v1\":
  
  Unknown macro: {"map"}
  
  ,\"v2\":
  
  Unknown macro: {"map"}
  
  }}">>,
  0,false,[]}],
  1024,true,undefined,
  [replica,missing,missing,missing,missing,
  missing,missing,missing,missing,missing,
  missing,missing,missing,missing,missing,
  missing,missing,missing,missing,missing,
  missing,missing,missing,missing,missing,
  missing,missing,missing,missing,missing,
  missing,missing,missing,missing,missing,
  missing,replica,replica,replica,replica,
  replica,replica,replica,replica,replica,
  replica,replica,replica,replica,replica,
  replica,replica,replica,replica,replica,
  replica,replica,replica,replica,replica,
  replica,active,active,active,active,active,
  active,active,active,active,active,active,
  active,active,active,active,active,active,
  active,active,active,active,active,active,
  active,active,active,active,active,active,
  active,active,active,active,active,active,
This bug is similar to bug ~~MB-6955~~