Details
Description
Setup:
1.Setup a 16 node cluster. Enable Auto-failover
2.Load data on all 3 buckets [around 3M, 2M, 367k] items.
3. Continue loading data..
4. Reboot orchestrator node [84]
5. Issue rebalance on this cluster.
Output
1. Node 84 goes down, and is auto-failed over - Expected.
2. Rebalance operation shows - No ejected Nodes on the rebalance params. [ Note: There was no explicit remove server action done on this]
3. Rebalance operation fails withe error" Wait for memcached" fails. Rebalance params -
4. Node 107 is failed over.
Likely the rebalance command was a faulty one? Should we disallow anyone to issue rebalance in-this manner?
The live cluster is accessible at http://10.3.2.89:8091/index.html#sec=log
Output from web-logs
2012-06-15 11:35:46.448 ns_orchestrator:4:info:message(ns_1@10.3.121.126) - Starting rebalance, KeepNodes = ['ns_1@10.3.2.84','ns_1@10.3.2.85',
'ns_1@10.3.2.86','ns_1@10.3.2.87',
'ns_1@10.3.2.88','ns_1@10.3.2.89',
'ns_1@10.3.2.104','ns_1@10.3.2.105',
'ns_1@10.3.2.106','ns_1@10.3.2.109',
'ns_1@10.3.2.110','ns_1@10.3.2.112',
'ns_1@10.3.2.113','ns_1@10.3.2.114',
'ns_1@10.3.121.126','ns_1@10.3.121.127',
'ns_1@10.3.2.107'], EjectNodes = []
2012-06-15 11:35:46.668 ns_rebalancer:0:info:message(ns_1@10.3.121.126) - Started rebalancing bucket bucket3
2012-06-15 11:35:49.335 ns_memcached:1:info:message(ns_1@10.3.2.107) - Bucket "bucket3" loaded on node 'ns_1@10.3.2.107' in 0 seconds.
2012-06-15 11:35:59.030 ns_orchestrator:2:info:message(ns_1@10.3.121.126) - Rebalance exited with reason
2012-06-15 11:36:03.901 ns_memcached:1:info:message(ns_1@10.3.2.84) - Bucket "bucket3" loaded on node 'ns_1@10.3.2.84' in 25 seconds.
2012-06-15 11:36:19.566 auto_failover:3:info:message(ns_1@10.3.121.126) - Could not auto-failover node ('ns_1@10.3.2.107'). There was at least another node down.
2012-06-15 11:36:19.566 auto_failover:3:info:message(ns_1@10.3.121.126) - Could not auto-failover node ('ns_1@10.3.2.84'). There was at least another node down.
Attached are the log file at : https://s3.amazonaws.com/bugdb/jira/bug-rebalance-largecluster/bug1.tar