Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Fix Version/s: 1.6.0 GA
Affects Version/s: 1.6.0 beta4
Component/s: ns_server
Labels:
None
Environment:
Operating System: All
Platform: All

Description

[forum report]

We are using beta4 running on an EC2 CentOS 5.4 x64 server.

The test scenario is that we have four membase servers in a cluster running correctly and we manually terminate 1-3 of them, via the AWS console.

We are now stuck in a state where the remaining server(s) has correctly identified that some of the other servers are down, but we are unable to do anything about it. Clicking Fail Over presents a loading spinner for approximately 5 seconds, before the page reloads. The log shows the following error:

Server error during processing: ["web request failed",

{path,"/controller/failOver"}

{type,exit}

,
{what,
{{

{nodedown,'ns_1@10.223.62.182'}

,
{gen_server,call,
[

{'ns_memcached-default', 'ns_1@10.223.62.182'}

{set_vbucket,1,pending}

,
30000]}},
{gen_fsm,sync_send_event,
[

{global,ns_orchestrator}

{failover,'ns_1@10.223.62.106'}

,
20000]}}},
{trace,
[

{gen_fsm,sync_send_event,3}

{ns_cluster_membership,failover,1}

{menelaus_web,handle_failover,1}

{menelaus_web,loop,3}

{mochiweb_http,headers,5}

{proc_lib,init_p_do_apply,3}

]}] (repeated 1 times)

Clicking remove servers puts them in pending rebalance, but clicking rebalance also results in failure with the following error in the logs:

Rebalance exited with reason noconnection
(repeated 1 times) ns_orchestrator002 20:47:05 - Tue Oct 5, 2010
Client-side error-report for user "Administrator" on node 'ns_1@10.122.10.120':
User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Got unhandled error: 'undefined' is null or not an object
At: http://174.129.106.43:8080/js/all.js:6238
Backtrace:
Function: collectBacktraceViaCaller
Args:

I have tried with replication enabled and disabled for the bucket but cannot seem to recover from this state. Obviously a fairly serious problem for us as we cannot have the entire cluster fail due to a single machine failure.

Any ideas? Are we doing something completely wrong here? Thanks in advance.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Sean Lynch (Inactive)

Reporter:: Dustin Sallings (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 06/Oct/10 5:04 AM

Updated:: 11/Oct/10 3:45 AM

Resolved:: 06/Oct/10 6:25 PM

Gerrit Reviews

There are no open Gerrit changes

Cluster stuck in unrecoverable state after server failures

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty