Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
1.7.1
-
Security Level: Public
-
EC2, 2 nodes in cluster (each m1.small), ubuntu 10.10 32bit
cat VERSION.txt
1.7.1
/opt/membase/bin$ ./erl --version
Erlang R14B02 (erts-5.8.3) [source] [rq:1] [async-threads:0] [hipe] [kernel-poll:false]
4 buckets. I use only memcached buckets (not membase ones). I do not use port specific buckets. All buckets use the standard port (11211). I access the buckets via client side stand alone moxi.EC2, 2 nodes in cluster (each m1.small), ubuntu 10.10 32bit cat VERSION.txt 1.7.1 /opt/membase/bin$ ./erl --version Erlang R14B02 (erts-5.8.3) [source] [rq:1] [async-threads:0] [hipe] [kernel-poll:false] 4 buckets. I use only memcached buckets (not membase ones). I do not use port specific buckets. All buckets use the standard port (11211). I access the buckets via client side stand alone moxi.
Description
My issue is similar to http://www.couchbase.org/issues/browse/MB-3965?focusedCommentId=22163#comment-22163 but I was told in the #couchbase IRC chan to start a new issue.
In a 2 node cluster, 1 node is reporting healthy, the other is reporting unhealthy/active. However it depends on what node's web console I log in to / run CLI's against. The one I log into always says it is healthy and the other one is unhealthy.
'/opt/membase/bin/membase server-info -c <ip of node1>:8091'. Says active/healthy for node1, says active/unhealthy for node2.
'/opt/membase/bin/membase server-info -c <ip of node2>:8091'. Says active/healthy for node2 and active/unhealthy for node1.
when running 'server-list' it always says the node I run the command against is healthy, and the other one is unhealthy/active.
I can get into the web console of both nodes in the instance. Node1 does have much higher network I/O than node2. Because of this, I am deeming that node2 is really the unhealthy one. The network I/O of node1 is not super high. The network I/O of Node2 is at levels of another one of my EC2 instances that is not doing anything.
My logfiles are too large to attach, so i have dropboxed them:
http://dl.dropbox.com/u/1374786/logs.tar.gz
membase_web_consoel_diaognostic_log.txt: was generated by clicking 'Generate Diagnostic Report' from Node1 (the good node).
nslogs_from_bad_node.txt: was generated by running /mbbrowse_logs from Node2 (the bad node)
I have replaced personal/sensitive information in the log files with descriptive strings.
goodNode = Node1
badNode = Node2
nodePendingRebalanceGoingToReplaceNode = a node that I have brought up that I am going to replace node2 with. It is pending re-balance.
domU-12-31-38-07-4E-E9.compute-1.internal, domU-12-31-39-0B-05-08.compute-1.internal,domU-12-31-39-10-8A-A5.compute-1.internal and domU-12-31-39-09-29-13.compute-1.internal = These instances are no longer in my ec2 account. I can not telnet into either of these on 11211, nor can i ping them. I have been doing alot of testing (bringing servers up/down) so my guess is these are remnants from that.
This cluster has been running fine for about a week before problem happened. So I do not think it is a permission problem. My guess is its some sort of network problem, but I do not know how to diagnose. Is there some membase CLI command I can run that verifies network connectivity on all the membase ports?
My EC2 security group allow for any node in the cluster to talk to any other node in the cluster over any TCP port (0 - 65535).
NOTE: I am using memcached buckets NOT membase buckets.