Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
2.0
-
Security Level: Public
-
centos 6.2 64bit build 2.0.0-1908
Description
Cluster information:
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- 24.8 GB RAM for couchbase server at each node
- SSD disk format ext4 on /data
- Each server has its own SSD drive, no disk sharing with other server.
- Create cluster with 6 nodes installed couchbase server 2.0.0-1908
- Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1908-rel.rpm.manifest.xml
- Cluster has 2 buckets, default (12GB with 2 replica) and saslbucket (12GB with 1 replica).
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
10.6.2.37
10.6.2.38
10.6.2.44
10.6.2.45
10.6.2.42
10.6.2.43
- Load 16 million items to default bucket and 20 million items to saslbuckett. Each key has size from 512 bytes to 1024 bytes
- After done loading, wait until initial index. Disable view compaction.
- After initial indexing done, mutate all items with size from 1024 to 1512 bytes.
- Queries all 4 views from 2 docs
- Do swap rebalance, remove node 39, 40 and add node 44, 45.
- At the end of rebalance saslbucket, rebalance exited with timeout on node 43
- Then see a lot of reset connection to mccouch. Updated bug
MB-7046 - Kill all loads pointing to this cluster. Node 43 did not back to stable state.
- Erlang in node 43 looks like hang. Kill beam.smp and restart couchbase server on node 43 (will update this bug soon)
- Waiting for node 43 completely up, disk write queue and drain rate down to 0.
- Start queries and mutate items on both bucket. Waiting a few minutes and start swap rebalance.
- Rebalance failed with error
2012-10-30 17:52:37.698 ns_rebalancer:0:info:message(ns_1@10.6.2.37) - Started rebalancing bucket saslbucket
2012-10-30 17:52:38.722 ns_vbucket_mover:0:info:message(ns_1@10.6.2.37) - Bucket "saslbucket" rebalance appears to be swap rebalance
2012-10-30 17:52:38.968 ns_rebalancer:0:info:message(ns_1@10.6.2.37) - Started rebalancing bucket default
2012-10-30 17:52:40.592 ns_vbucket_mover:0:info:message(ns_1@10.6.2.37) - Bucket "default" rebalance does not seem to be swap rebalance
2012-10-30 17:52:48.027 ns_vbucket_mover:0:critical:message(ns_1@10.6.2.37) - <0.4774.771> exited with {noproc,
{gen_server,call,
[
{if_rebalance,<0.4746.771>,
{update_vbucket_state,61,active,paused,
undefined}},
infinity]}}
Look at diag log of node 37, see error: "Partition 857 not in active nor passive set"
[ns_server:error,2012-10-30T17:52:45.760,ns_1@10.6.2.37:<0.20186.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 589040 us
[ns_server:error,2012-10-30T17:52:46.835,ns_1@10.6.2.37:<0.16711.0>:ns_memcached:verify_report_long_call:297]call topkeys took too long: 963763 us
[rebalance:info,2012-10-30T17:52:47.098,ns_1@10.6.2.37:<0.4781.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 530 state change: {'ns_1@10.6.2.40',active,paused,undefined}
[rebalance:info,2012-10-30T17:52:47.153,ns_1@10.6.2.37:<0.4781.771>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 530 on ns_1@10.6.2.40
[rebalance:debug,2012-10-30T17:52:47.162,ns_1@10.6.2.37:<0.6121.771>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 44 in vbucket 530
[ns_server:debug,2012-10-30T17:52:47.203,ns_1@10.6.2.37:janitor_agent-default<0.16716.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.6121.771> (ok)
[rebalance:info,2012-10-30T17:52:47.206,ns_1@10.6.2.37:<0.6125.771>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.6.2.37 (vbucket 530)
[ns_server:debug,2012-10-30T17:52:47.225,ns_1@10.6.2.37:<0.6127.771>:capi_set_view_manager:do_wait_index_updated:608]References to wait: Ref<0.0.3553.75720> ("default", 530)
[ns_server:info,2012-10-30T17:52:47.482,ns_1@10.6.2.37:ns_config_rep<0.758.0>:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.6.2.40'
[rebalance:info,2012-10-30T17:52:47.626,ns_1@10.6.2.37:<0.4779.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 857 state change: {'ns_1@10.6.2.43',active,paused,undefined}
[rebalance:info,2012-10-30T17:52:47.709,ns_1@10.6.2.37:<0.4779.771>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 857 on ns_1@10.6.2.43
[rebalance:debug,2012-10-30T17:52:47.739,ns_1@10.6.2.37:<0.6284.771>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 42 in vbucket 857
[ns_server:debug,2012-10-30T17:52:47.753,ns_1@10.6.2.37:janitor_agent-default<0.16716.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.6284.771> (ok)
[rebalance:info,2012-10-30T17:52:47.758,ns_1@10.6.2.37:<0.6286.771>:janitor_agent:wait_index_updated:459]default: Doing wait_index_updated call for ns_1@10.6.2.37 (vbucket 857)
[ns_server:debug,2012-10-30T17:52:47.772,ns_1@10.6.2.37:<0.16695.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription {ns_config_events,<0.16694.0>} exited with reason {{badmatch,
{error,
{error, <<"Partition 857 not in active nor passive set">>}}},
[{capi_set_view_manager, handle_call, 3},
{gen_server, handle_msg, 5},
{gen_server, init_it, 6},
{proc_lib, init_p_do_apply, 3}]}
[ns_server:debug,2012-10-30T17:52:47.772,ns_1@10.6.2.37:<0.6127.771>:capi_set_view_manager:do_wait_index_updated:621]Parent died: {'DOWN',#Ref<0.0.3553.75791>,process,<0.16694.0>,
{{badmatch,
{error,
{error, <<"Partition 857 not in active nor passive set">>}}},
[{capi_set_view_manager,handle_call,3},
{gen_server,handle_msg,5},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]}}
[error_logger:error,2012-10-30T17:52:47.775,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: erlang:apply/2
pid: <0.6297.771>
registered_name: []
exception throw: {error,<<"Partition 857 not in active nor passive set">>}
in function couch_set_view:monitor_partition_update/3
in call from capi_set_view_manager:'
in call from lists:foldl/3
in call from capi_set_view_manager:do_wait_index_updated/4
ancestors: ['capi_set_view_manager-default','single_bucket_sup-default',
<0.16686.0>]
messages: []
links: [<0.16694.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 6765
stack_size: 24
reductions: 229
neighbours:
[ns_server:error,2012-10-30T17:52:47.778,ns_1@10.6.2.37:<0.4781.771>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.6125.771>,
{{{{badmatch,
{error,
{error, <<"Partition 857 not in active nor passive set">>}}},
[{capi_set_view_manager,handle_call,3},
{gen_server,handle_msg,5},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
['capi_set_view_manager-default',
{wait_index_updated,857},
infinity]}},
{gen_server,call,
[{'janitor_agent-default','ns_1@10.6.2.37'}
,
{if_rebalance,<0.4746.771>,
{wait_index_updated,530}},
infinity]}}}
[ns_server:debug,2012-10-30T17:52:47.804,ns_1@10.6.2.37:<0.16705.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription
exited with reason {{badmatch,
{error,
}},
[
,
,
,
]}
[ns_server:debug,2012-10-30T17:52:47.804,ns_1@10.6.2.37:<0.16704.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription
exited with reason {{badmatch,
{error,
}},
[
,
,
,
]}
[ns_server:error,2012-10-30T17:52:47.808,ns_1@10.6.2.37:<0.4779.771>:ns_single_vbucket_mover:spawn_and_wait:81]Got unexpected exit signal {'EXIT',<0.6286.771>,
{{{{badmatch,
{error,
}},
[
,
,
,
]},
{gen_server,call,
['capi_set_view_manager-default',
infinity]}},
{gen_server,call,
[{'janitor_agent-default','ns_1@10.6.2.37'},
{if_rebalance,<0.4746.771>,
{wait_index_updated,857}},
infinity]}}}
[rebalance:info,2012-10-30T17:52:47.822,ns_1@10.6.2.37:<0.5479.771>:ebucketmigrator_srv:do_confirm_sent_messages:671]Got close ack!
[rebalance:info,2012-10-30T17:52:47.827,ns_1@10.6.2.37:<0.5314.771>:ebucketmigrator_srv:do_confirm_sent_messages:671]Got close ack!
[rebalance:info,2012-10-30T17:52:47.835,ns_1@10.6.2.37:<0.4774.771>:janitor_agent:set_vbucket_state:419]Doing vbucket 61 state change: {'ns_1@10.6.2.37',active,paused,undefined}
[error_logger:error,2012-10-30T17:52:47.811,ns_1@10.6.2.37:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.16694.0> terminating
** Last message in was {wait_index_updated,857}
-
- When Server state == {state,"default",'capi_ddoc_replication_srv-default',
['ns_1@10.6.2.38','ns_1@10.6.2.42',
'ns_1@10.6.2.43','ns_1@10.6.2.44',
'ns_1@10.6.2.45','ns_1@10.6.2.39',
'ns_1@10.6.2.40'],
[{doc,<<"_design/d1">>,
{1,<<56,131,187,118>>},
<<"{\"views\":{\"v1\":Unknown macro: {"map"},\"v2\":
Unknown macro: {"map"}}}">>,
0,false,[]}],
1024,true,undefined,
[replica,missing,missing,missing,missing,
missing,missing,missing,missing,missing,
missing,missing,missing,missing,missing,
missing,missing,missing,missing,missing,
missing,missing,missing,missing,missing,
missing,missing,missing,missing,missing,
missing,missing,missing,missing,missing,
missing,replica,replica,replica,replica,
replica,replica,replica,replica,replica,
replica,replica,replica,replica,replica,
replica,replica,replica,replica,replica,
replica,replica,replica,replica,replica,
replica,active,active,active,active,active,
active,active,active,active,active,active,
active,active,active,active,active,active,
active,active,active,active,active,active,
active,active,active,active,active,active,
active,active,active,active,active,active,
- When Server state == {state,"default",'capi_ddoc_replication_srv-default',
- This bug is similar to bug
MB-6955