Details
Description
Setup
1. Create a 13 node cluster with 1 bucket, 1024 vBuckets
2. Load 51M items on the cluster [256 bytes - 512 bytes]
3. Enable auto-failover
4. Mutate existing items and create new items [200 -612 bytes] to create around 60M items.
5. Each node has high swap usage(20%) [refer bug - MB-5392]
6. Add 2 nodes (10.3.2.8, 10.3.2.9) and issue rebalance
Output
1. Rebalance fails with error " missing_checkpoint_stats"
Stats/Resources/Screenshots
1.Attached are the memory stats at https://s3.amazonaws.com/bugdb/jira/MB-rebalanceFail/05-29-rebal.tar
2.Attaching the screenshot from the cluster
3. Cluster can be accessed at http://10.3.2.8:8091/index.html#sec=overview
Some errors that could be related to rebalance failure
delete_vbucket and stats call taking too long.
[ns_server:error] [2012-05-29 10:58:22] [ns_1@10.3.2.42:ns_doctor:ns_doctor:update_status:154] The following buckets became not ready on node 'ns_1@10.3.2.42': ["default"], those of them are active ["default"]
[ns_server:error] [2012-05-29 10:58:31] [ns_1@10.3.2.42:'ns_memcached-default':ns_memcached:handle_call:135] call
[ns_server:error] [2012-05-29 11:09:34] [ns_1@10.3.2.42:'ns_memcached-default':ns_memcached:handle_call:135] call {stats,<<>>} took too long: 577806 us
[ns_server:error] [2012-05-29 11:13:19] [ns_1@10.3.2.42:'ns_memcached-default':ns_memcached:handle_info:277] handle_info(ensure_bucket,..) took too long: 864638 us
========================CRASH REPORT=========================
crasher:
initial call: ns_janitor:cleanup/2
pid: <0.21807.93>
registered_name: []
exception exit: {timeout,
{gen_server,call,
[{'ns_memcached-default','ns_1@10.3.2.42'},
{delete_vbucket,874}
,
30000]}}
in function gen_server:call/3
in call from ns_memcached:do_call/3
in call from lists:foreach/2
in call from ns_janitor:do_sanify_chain/6
in call from ns_janitor:sanify_chain/6
in call from ns_janitor:'sanify/5-lc$^1/1-1'/5
in call from ns_janitor:'sanify/5-lc$^1/1-1'/5
in call from ns_janitor:do_cleanup/3
ancestors: [<0.217.0>,mb_master_sup,mb_master,ns_server_sup,
ns_server_cluster_sup,<0.60.0>]
messages: []
links: [<0.217.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 75025
stack_size: 24
reductions: 1835901
neighbours:
[error_logger:error] [2012-05-29 10:58:07] [ns_1@10.3.2.42:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server 'ns_memcached-default' terminating
-
- Last message in was {delete_vbucket,874}
- When Server state ==
Unknown macro: {state,{interval,#Ref<0.0.82.161858>}, connected, {1338,314257,216912}, "default",#Port<0.226981>}
- Reason for termination ==
- badmatch,{error,timeout,
[ {mc_client_binary,cmd_binary_vocal_recv,5},
{mc_client_binary,delete_vbucket,2}
,
{ns_memcached,do_handle_call,3}
,
{ns_memcached,handle_call,3}
,
{gen_server,handle_msg,5}
,
{proc_lib,init_p_do_apply,3}
]}