Details
Description
[forum report]
We are using beta4 running on an EC2 CentOS 5.4 x64 server.
The test scenario is that we have four membase servers in a cluster running correctly and we manually terminate 1-3 of them, via the AWS console.
We are now stuck in a state where the remaining server(s) has correctly identified that some of the other servers are down, but we are unable to do anything about it. Clicking Fail Over presents a loading spinner for approximately 5 seconds, before the page reloads. The log shows the following error:
Server error during processing: ["web request failed",
{path,"/controller/failOver"},
{type,exit},
{what,
{{
,
{gen_server,call,
[
,
{set_vbucket,1,pending},
30000]}},
{gen_fsm,sync_send_event,
[
,
{failover,'ns_1@10.223.62.106'},
20000]}}},
{trace,
[
,
{ns_cluster_membership,failover,1},
{menelaus_web,handle_failover,1},
{menelaus_web,loop,3},
{mochiweb_http,headers,5},
{proc_lib,init_p_do_apply,3}]}] (repeated 1 times)
Clicking remove servers puts them in pending rebalance, but clicking rebalance also results in failure with the following error in the logs:
Rebalance exited with reason noconnection
(repeated 1 times) ns_orchestrator002 20:47:05 - Tue Oct 5, 2010
Client-side error-report for user "Administrator" on node 'ns_1@10.122.10.120':
User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Got unhandled error: 'undefined' is null or not an object
At: http://174.129.106.43:8080/js/all.js:6238
Backtrace:
Function: collectBacktraceViaCaller
Args:
I have tried with replication enabled and disabled for the bucket but cannot seem to recover from this state. Obviously a fairly serious problem for us as we cannot have the entire cluster fail due to a single machine failure.
Any ideas? Are we doing something completely wrong here? Thanks in advance.