Details
-
Bug
-
Resolution: Fixed
-
Test Blocker
-
3.0
-
Security Level: Public
-
None
-
CentOS 6.x 8*8 clusters 2 uni-xdcrs
Each node 15GB RAM, 4cores
-
Untriaged
-
Unknown
-
June 30 - July 18
Description
Build
--------
3.0.0-786 (xdcr on upr, internal replication on upr)
Clusters
-----------
Source : http://172.23.105.44:8091/
Destination : http://172.23.105.54:8091/
The clusters are available to investigate. No urgency to reclaim. Pls let me know if you need me to collect logs.
Steps
--------
1. Load on both clusters till vb_active_resident_items_ratio < 30.
2. Access phase with 98% gets, 2%sets runs for 3 hours
3. Rebalance-out 1 node at cluster1 with workload [high dgm ~4%]
Every attempt to rebalance out one node fails. The last one left 3 nodes in pending state.
First rebalance-out failed with error:
-----------------------------------------------
Many messages like -
Control connection to memcached on 'ns_1@172.23.105.49' disconnected: {{badmatch,
{error,
timeout}},
[{mc_client_binary,
stats_recv,
4,
[
{mc_client_binary,
stats,
4,
[{file, "src/mc_client_binary.erl"}
,
{line, 411}]},
{ns_memcached,
handle_info,
2,
[
{gen_server,
handle_msg,
5,
[{file, "gen_server.erl"}, {line, 604}]},
{ns_memcached,
init,
1,
[{file, "src/ns_memcached.erl"}
,
{line, 170}]},
{gen_server,
init_it,
6,
[
,
{line, 304}]},
{proc_lib,
init_p_do_apply,
3,
[
,
{line, 239}]}]}
Subsequent rebalance-out attempts
-------------------------------------------------
timeout}} ns_memcached000 ns_1@172.23.105.52 14:20:19 - Fri Jun 6, 2014
Control connection to memcached on 'ns_1@172.23.105.48' disconnected: {badmatch,
{error,
timeout}} ns_memcached000 ns_1@172.23.105.48 14:20:19 - Fri Jun 6, 2014
Control connection to memcached on 'ns_1@172.23.105.45' disconnected: {badmatch,
{error,
timeout}} ns_memcached000 ns_1@172.23.105.45 14:20:19 - Fri Jun 6, 2014
Rebalance exited with reason
ns_orchestrator002 ns_1@172.23.105.44 14:17:19 - Fri Jun 6, 2014
Bucket "saslbucket" loaded on node 'ns_1@172.23.105.52' in 0 seconds. ns_memcached000 ns_1@172.23.105.52 14:16:32 - Fri Jun 6, 2014
Bucket "saslbucket" loaded on node 'ns_1@172.23.105.45' in 0 seconds. ns_memcached000 ns_1@172.23.105.45 14:16:32 - Fri Jun 6, 2014
Control connection to memcached on 'ns_1@172.23.105.45' disconnected: {badmatch,
{error,
timeout}} ns_memcached000 ns_1@172.23.105.45 14:16:32 - Fri Jun 6, 2014
Control connection to memcached on 'ns_1@172.23.105.52' disconnected: {badmatch,
{error,
timeout}} ns_memcached000 ns_1@172.23.105.52 14:16:32 - Fri Jun 6, 2014
Started rebalancing bucket standardbucket1
Starting rebalance, KeepNodes = ['ns_1@172.23.105.44','ns_1@172.23.105.45',
'ns_1@172.23.105.48','ns_1@172.23.105.49',
'ns_1@172.23.105.50','ns_1@172.23.105.51',
'ns_1@172.23.105.52'], EjectNodes = ['ns_1@172.23.105.47'], Failed over and being ejected nodes = []; no delta recovery nodes
Rebalance exited with reason {not_all_nodes_are_ready_yet, ['ns_1@172.23.105.50']}
Started rebalancing bucket standardbucket1
Starting rebalance, KeepNodes = ['ns_1@172.23.105.44','ns_1@172.23.105.45',
'ns_1@172.23.105.48','ns_1@172.23.105.49',
'ns_1@172.23.105.50','ns_1@172.23.105.51',
'ns_1@172.23.105.52'], EjectNodes = ['ns_1@172.23.105.47'], Failed over and being ejected nodes = []; no delta recovery nodes
Pls feel free to close if another similar issue is still open.
Attachments
Issue Links
- relates to
-
MB-11351 ns_server's ns_heart and janitor_agent may get totally stuck if some upr stuff inside ep-engine gets stuck
- Closed