Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4285

node marked unhealthy, but seems to be healthy. Each node saying other is unhealthy.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • 1.7.1
    • couchbase-bucket, ns_server
    • Security Level: Public

    Description

      My issue is similar to http://www.couchbase.org/issues/browse/MB-3965?focusedCommentId=22163#comment-22163 but I was told in the #couchbase IRC chan to start a new issue.

      In a 2 node cluster, 1 node is reporting healthy, the other is reporting unhealthy/active. However it depends on what node's web console I log in to / run CLI's against. The one I log into always says it is healthy and the other one is unhealthy.

      '/opt/membase/bin/membase server-info -c <ip of node1>:8091'. Says active/healthy for node1, says active/unhealthy for node2.
      '/opt/membase/bin/membase server-info -c <ip of node2>:8091'. Says active/healthy for node2 and active/unhealthy for node1.

      when running 'server-list' it always says the node I run the command against is healthy, and the other one is unhealthy/active.

      I can get into the web console of both nodes in the instance. Node1 does have much higher network I/O than node2. Because of this, I am deeming that node2 is really the unhealthy one. The network I/O of node1 is not super high. The network I/O of Node2 is at levels of another one of my EC2 instances that is not doing anything.

      My logfiles are too large to attach, so i have dropboxed them:
      http://dl.dropbox.com/u/1374786/logs.tar.gz

      membase_web_consoel_diaognostic_log.txt: was generated by clicking 'Generate Diagnostic Report' from Node1 (the good node).
      nslogs_from_bad_node.txt: was generated by running /mbbrowse_logs from Node2 (the bad node)

      I have replaced personal/sensitive information in the log files with descriptive strings.

      goodNode = Node1
      badNode = Node2
      nodePendingRebalanceGoingToReplaceNode = a node that I have brought up that I am going to replace node2 with. It is pending re-balance.
      domU-12-31-38-07-4E-E9.compute-1.internal, domU-12-31-39-0B-05-08.compute-1.internal,domU-12-31-39-10-8A-A5.compute-1.internal and domU-12-31-39-09-29-13.compute-1.internal = These instances are no longer in my ec2 account. I can not telnet into either of these on 11211, nor can i ping them. I have been doing alot of testing (bringing servers up/down) so my guess is these are remnants from that.

      This cluster has been running fine for about a week before problem happened. So I do not think it is a permission problem. My guess is its some sort of network problem, but I do not know how to diagnose. Is there some membase CLI command I can run that verifies network connectivity on all the membase ports?

      My EC2 security group allow for any node in the cluster to talk to any other node in the cluster over any TCP port (0 - 65535).

      NOTE: I am using memcached buckets NOT membase buckets.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Unassigned Unassigned
            rynop rynop
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty