Description
From MB-6058. I've found that first rebalance failure happened because during vbucket filter change one of migrator's request for "checkpoint" stat on source node (.25) closed connection without returning anything.
That what happens:
[error_logger:error] [2012-08-13 11:46:34] [ns_1@10.3.121.26:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server <0.20247.25> terminating
-
- Last message in was {start_vbucket_filter_change, [74,92,93,263,264,298,365,366]}
- When Server state == {state,#Port<0.2055414>,#Port<0.2055409>,
#Port<0.2055416>,#Port<0.2055410>,<0.20249.25>,
<<>>,<<>>,
{set,7,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]},
{1344,883254,766949}
{{[263],
[298],
[365],
[],"
",[],[],[],[],[],[],
[264],
[],[],"J","]"}}},
74959,false,false,0,
,
not_started,undefined,
<<"replication_ns_1@10.3.121.26">>,
<0.20247.25>,
Unknown macro: {had_backfill,false,undefined,[]}}
- Reason for termination ==
- badmatch,{error,closed,
[ {mc_binary,quick_stats_recv,3},
{mc_binary,quick_stats_loop,5}
,
{mc_binary,quick_stats,5}
,
{ebucketmigrator_srv,handle_call,3}
,
{gen_server,handle_msg,5}
,
{proc_lib,init_p_do_apply,3}
]}
By analyzing logs I see that this was call on so called 'upstream-aux' connection (non-tap connection we maintain to upstream node (in this case .25) in order to request stats) and that this could be only "checkpoints" request.
P.S. Given that we have those better logging integrated, I guess you can tell us now what log level to enable for memcached log file that will enable us to understand this properly.
Full logs can be obtained at: https://s3.amazonaws.com/packages.couchbase/atop-files/2.0.0/atop-10nodes-1573-swap-reb-reboot-failed-20120813.tgz