Details
Description
Upgrade Mango Cluster from 2.0.0 ee to 2.0.1 ee
Cluster Info:
3 nodes (labeled A,B,and C below), each node has 4 cores and 6g of ram
A - 10.1.3.248 (currently down)
B - 10.1.3.251
C - 10.1.4.6
Sequence of events
1. Original plan was to do a swap rebalance online upgrade. However, our spare node lacked sufficient ram to join the cluster.
2. Next, plan was to upgrade one node at a time, rebalancing nodes in and out, keeping the cluster online.
3. Node A was rebalanced out of the cluster. This took about 2 - 2.5 hours. Occasionally the UI would indicate a failed XHR, but it was always succeed when retried. The operation was successful.
4. Node A was upgraded as follows.
a. /etc/init.d/couchbase-server stop
b. verified all couch processes stopped
c. rpm -e couchbase-server
d. rm -rf /opt/couchbase
e. rpm -ivh <couchbase 2.0.1 rpm file>
5. Node A was added to the cluster
6. Rebalance was started
7. About an hour later I noticed that the console said "rebalance failed". I wasn't watching the console the whole time so I can't describe how far it got or if there were timeouts occuring...
8. The error log in the UI indicated:
Control connection to memcached on 'ns_1@10.1.3.248' disconnected: {{badmatch,
{error,
closed}},
[
,
{mc_client_binary, select_bucket, 2},
{ns_memcached, ensure_bucket, 2},
{ns_memcached, handle_info, 2},
{gen_server, handle_msg, 5},
{proc_lib, init_p_do_apply, 3}]} (repeated 2 times)
I did not dig deeper into the logs at this point.
9. To my surprise, the node was a part of the cluster. 3 of our 4 buckets were indicating they had 3 servers. 1 bucket inidicated it only had 2 servers. I wasn't aware rebalance could fail in this way and I assumed my only option was to remove the node, rebalance, and try again.
10. Node A was removed from the cluster.
11. Rebalance was started.
12. Rebalance was succesful, all buckets report 2 servers.
13. At this point I decided to try again.
14. Node A was added to the cluster.
15. Rebalance was started. This time I sat and watched the whole thing.
16. Rebalance made to 50% (on all nodes) without any apparent issues.
17. Sometime after this it appeared that XHR requests were timing out to node A.
18. Shortly after that it appeared that node A had crashed. Other nodes in the cluster consider it down. It responds to pings, but I cannot ssh into the server.
19. At this point the cluster is still operational. The console shows 3 servers, but 1 is down. I see 4 buckets, one is 2/2 of servers, and the other three servers show 2/3 servers.
The primary question is what to do at this point...
1. Normally I would have expected to failover the crashed node. However, the console warns me that replica copies doen't exist for all data on this node. How could this happen? Even if node A had already become master for some of the vbuckets, shouldn't the replica be on node B or C?
2. Node A will be rebooted shortly, does that improve the situation? make things worse?
What should we do at this point?