Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: 1.7.1
Component/s: couchbase-bucket, ns_server
Security Level: Public
Labels:
Environment:

Hide
EC2, 2 nodes in cluster (each m1.small), ubuntu 10.10 32bit

cat VERSION.txt
1.7.1

/opt/membase/bin$ ./erl --version
Erlang R14B02 (erts-5.8.3) [source] [rq:1] [async-threads:0] [hipe] [kernel-poll:false]

4 buckets. I use only memcached buckets (not membase ones). I do not use port specific buckets. All buckets use the standard port (11211). I access the buckets via client side stand alone moxi.

Show
EC2, 2 nodes in cluster (each m1.small), ubuntu 10.10 32bit cat VERSION.txt 1.7.1 /opt/membase/bin$ ./erl --version Erlang R14B02 (erts-5.8.3) [source] [rq:1] [async-threads:0] [hipe] [kernel-poll:false] 4 buckets. I use only memcached buckets (not membase ones). I do not use port specific buckets. All buckets use the standard port (11211). I access the buckets via client side stand alone moxi.

Description

My issue is similar to http://www.couchbase.org/issues/browse/MB-3965?focusedCommentId=22163#comment-22163 but I was told in the #couchbase IRC chan to start a new issue.

In a 2 node cluster, 1 node is reporting healthy, the other is reporting unhealthy/active. However it depends on what node's web console I log in to / run CLI's against. The one I log into always says it is healthy and the other one is unhealthy.

'/opt/membase/bin/membase server-info -c <ip of node1>:8091'. Says active/healthy for node1, says active/unhealthy for node2.
'/opt/membase/bin/membase server-info -c <ip of node2>:8091'. Says active/healthy for node2 and active/unhealthy for node1.

when running 'server-list' it always says the node I run the command against is healthy, and the other one is unhealthy/active.

I can get into the web console of both nodes in the instance. Node1 does have much higher network I/O than node2. Because of this, I am deeming that node2 is really the unhealthy one. The network I/O of node1 is not super high. The network I/O of Node2 is at levels of another one of my EC2 instances that is not doing anything.

My logfiles are too large to attach, so i have dropboxed them:
http://dl.dropbox.com/u/1374786/logs.tar.gz

membase_web_consoel_diaognostic_log.txt: was generated by clicking 'Generate Diagnostic Report' from Node1 (the good node).
nslogs_from_bad_node.txt: was generated by running /mbbrowse_logs from Node2 (the bad node)

I have replaced personal/sensitive information in the log files with descriptive strings.

goodNode = Node1
badNode = Node2
nodePendingRebalanceGoingToReplaceNode = a node that I have brought up that I am going to replace node2 with. It is pending re-balance.
domU-12-31-38-07-4E-E9.compute-1.internal, domU-12-31-39-0B-05-08.compute-1.internal,domU-12-31-39-10-8A-A5.compute-1.internal and domU-12-31-39-09-29-13.compute-1.internal = These instances are no longer in my ec2 account. I can not telnet into either of these on 11211, nor can i ping them. I have been doing alot of testing (bringing servers up/down) so my guess is these are remnants from that.

This cluster has been running fine for about a week before problem happened. So I do not think it is a permission problem. My guess is its some sort of network problem, but I do not know how to diagnose. Is there some membase CLI command I can run that verifies network connectivity on all the membase ports?

My EC2 security group allow for any node in the cluster to talk to any other node in the cluster over any TCP port (0 - 65535).

NOTE: I am using memcached buckets NOT membase buckets.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Unassigned

Reporter:: rynop

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 16/Sep/11 9:29 AM

Updated:: 18/Jun/13 9:12 PM

Resolved:: 05/Oct/11 9:00 AM

Gerrit Reviews

There are no open Gerrit changes

node marked unhealthy, but seems to be healthy. Each node saying other is unhealthy.

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty