Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: 2.0-beta
Affects Version/s: 1.8.1
Component/s: ns_server
Security Level: Public
Labels:
None
Environment:
Large Cluster - Centos, - 16 node cluster
Build 181-916rel
3 buckets - bucket1(3G), bucket2(2.8G), bucket3(200M)

Description

Setup:
1.Setup a 16 node cluster. Enable Auto-failover
2.Load data on all 3 buckets [around 3M, 2M, 367k] items.
3. Continue loading data..
4. Reboot orchestrator node [84]
5. Issue rebalance on this cluster.

Output
1. Node 84 goes down, and is auto-failed over - Expected.
2. Rebalance operation shows - No ejected Nodes on the rebalance params. [ Note: There was no explicit remove server action done on this]
3. Rebalance operation fails withe error" Wait for memcached" fails. Rebalance params -
4. Node 107 is failed over.

Likely the rebalance command was a faulty one? Should we disallow anyone to issue rebalance in-this manner?

The live cluster is accessible at http://10.3.2.89:8091/index.html#sec=log
Output from web-logs
2012-06-15 11:35:46.448 ns_orchestrator:4:info:message(ns_1@10.3.121.126) - Starting rebalance, KeepNodes = ['ns_1@10.3.2.84','ns_1@10.3.2.85',
'ns_1@10.3.2.86','ns_1@10.3.2.87',
'ns_1@10.3.2.88','ns_1@10.3.2.89',
'ns_1@10.3.2.104','ns_1@10.3.2.105',
'ns_1@10.3.2.106','ns_1@10.3.2.109',
'ns_1@10.3.2.110','ns_1@10.3.2.112',
'ns_1@10.3.2.113','ns_1@10.3.2.114',
'ns_1@10.3.121.126','ns_1@10.3.121.127',
'ns_1@10.3.2.107'], EjectNodes = []

2012-06-15 11:35:46.668 ns_rebalancer:0:info:message(ns_1@10.3.121.126) - Started rebalancing bucket bucket3
2012-06-15 11:35:49.335 ns_memcached:1:info:message(ns_1@10.3.2.107) - Bucket "bucket3" loaded on node 'ns_1@10.3.2.107' in 0 seconds.
2012-06-15 11:35:59.030 ns_orchestrator:2:info:message(ns_1@10.3.121.126) - Rebalance exited with reason

{wait_for_memcached_failed,"bucket3", ['ns_1@10.3.2.84']}

2012-06-15 11:36:03.901 ns_memcached:1:info:message(ns_1@10.3.2.84) - Bucket "bucket3" loaded on node 'ns_1@10.3.2.84' in 25 seconds.
2012-06-15 11:36:19.566 auto_failover:3:info:message(ns_1@10.3.121.126) - Could not auto-failover node ('ns_1@10.3.2.107'). There was at least another node down.

2012-06-15 11:36:19.566 auto_failover:3:info:message(ns_1@10.3.121.126) - Could not auto-failover node ('ns_1@10.3.2.84'). There was at least another node down.

Attached are the log file at : https://s3.amazonaws.com/bugdb/jira/bug-rebalance-largecluster/bug1.tar

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Dipti Borkar (Inactive)

Reporter:: Ketaki Gangal (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Jun/12 12:12 PM

Updated:: 09/Jan/13 8:59 PM

Resolved:: 03/Aug/12 4:40 PM

Gerrit Reviews

There are no open Gerrit changes

Rebalance fails with error "wait_for_memcached_failed,"bucket3" on issuing a rebalance after reboot of master node.

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty