Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-2450

Cluster stuck in unrecoverable state after server failures

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • 1.6.0 GA
    • 1.6.0 beta4
    • ns_server
    • None
    • Operating System: All
      Platform: All

    Description

      [forum report]

      We are using beta4 running on an EC2 CentOS 5.4 x64 server.

      The test scenario is that we have four membase servers in a cluster running correctly and we manually terminate 1-3 of them, via the AWS console.

      We are now stuck in a state where the remaining server(s) has correctly identified that some of the other servers are down, but we are unable to do anything about it. Clicking Fail Over presents a loading spinner for approximately 5 seconds, before the page reloads. The log shows the following error:

      Server error during processing: ["web request failed",

      {path,"/controller/failOver"}

      ,

      {type,exit}

      ,
      {what,
      {{

      {nodedown,'ns_1@10.223.62.182'}

      ,
      {gen_server,call,
      [

      {'ns_memcached-default', 'ns_1@10.223.62.182'}

      ,

      {set_vbucket,1,pending}

      ,
      30000]}},
      {gen_fsm,sync_send_event,
      [

      {global,ns_orchestrator}

      ,

      {failover,'ns_1@10.223.62.106'}

      ,
      20000]}}},
      {trace,
      [

      {gen_fsm,sync_send_event,3}

      ,

      {ns_cluster_membership,failover,1}

      ,

      {menelaus_web,handle_failover,1}

      ,

      {menelaus_web,loop,3}

      ,

      {mochiweb_http,headers,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}] (repeated 1 times)

      Clicking remove servers puts them in pending rebalance, but clicking rebalance also results in failure with the following error in the logs:

      Rebalance exited with reason noconnection
      (repeated 1 times) ns_orchestrator002 20:47:05 - Tue Oct 5, 2010
      Client-side error-report for user "Administrator" on node 'ns_1@10.122.10.120':
      User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
      Got unhandled error: 'undefined' is null or not an object
      At: http://174.129.106.43:8080/js/all.js:6238
      Backtrace:
      Function: collectBacktraceViaCaller
      Args:

      I have tried with replication enabled and disabled for the bucket but cannot seem to recover from this state. Obviously a fairly serious problem for us as we cannot have the entire cluster fail due to a single machine failure.

      Any ideas? Are we doing something completely wrong here? Thanks in advance.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            sean@northscale.com Sean Lynch (Inactive)
            dustin@sallings.org Dustin Sallings (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty