Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: 3.0
Affects Version/s: 2.0
Component/s: tools
Security Level: Public
Labels:
None
Environment:
3 node cluster:

Linux mango-010 2.6.18-274.3.1.el5xen #1 SMP Tue Sep 6 20:57:11 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
CentOS release 5.7 (Final)
each has 4 cores, 6G ram
4 buckets

Description

Upgrade Mango Cluster from 2.0.0 ee to 2.0.1 ee

Cluster Info:

3 nodes (labeled A,B,and C below), each node has 4 cores and 6g of ram

A - 10.1.3.248 (currently down)
B - 10.1.3.251
C - 10.1.4.6

Sequence of events

1. Original plan was to do a swap rebalance online upgrade. However, our spare node lacked sufficient ram to join the cluster.

2. Next, plan was to upgrade one node at a time, rebalancing nodes in and out, keeping the cluster online.

3. Node A was rebalanced out of the cluster. This took about 2 - 2.5 hours. Occasionally the UI would indicate a failed XHR, but it was always succeed when retried. The operation was successful.

4. Node A was upgraded as follows.

a. /etc/init.d/couchbase-server stop
b. verified all couch processes stopped
c. rpm -e couchbase-server
d. rm -rf /opt/couchbase
e. rpm -ivh <couchbase 2.0.1 rpm file>

5. Node A was added to the cluster

6. Rebalance was started

7. About an hour later I noticed that the console said "rebalance failed". I wasn't watching the console the whole time so I can't describe how far it got or if there were timeouts occuring...

8. The error log in the UI indicated:

Control connection to memcached on 'ns_1@10.1.3.248' disconnected: {{badmatch,
{error,
closed}},
[

{mc_client_binary, cmd_binary_vocal_recv, 5}

{mc_client_binary, select_bucket, 2}

{ns_memcached, ensure_bucket, 2}

{ns_memcached, handle_info, 2}

{gen_server, handle_msg, 5}

{proc_lib, init_p_do_apply, 3}

]} (repeated 2 times)

I did not dig deeper into the logs at this point.

9. To my surprise, the node was a part of the cluster. 3 of our 4 buckets were indicating they had 3 servers. 1 bucket inidicated it only had 2 servers. I wasn't aware rebalance could fail in this way and I assumed my only option was to remove the node, rebalance, and try again.

10. Node A was removed from the cluster.

11. Rebalance was started.

12. Rebalance was succesful, all buckets report 2 servers.

13. At this point I decided to try again.

14. Node A was added to the cluster.

15. Rebalance was started. This time I sat and watched the whole thing.

16. Rebalance made to 50% (on all nodes) without any apparent issues.

17. Sometime after this it appeared that XHR requests were timing out to node A.

18. Shortly after that it appeared that node A had crashed. Other nodes in the cluster consider it down. It responds to pings, but I cannot ssh into the server.

19. At this point the cluster is still operational. The console shows 3 servers, but 1 is down. I see 4 buckets, one is 2/2 of servers, and the other three servers show 2/3 servers.

The primary question is what to do at this point...

1. Normally I would have expected to failover the crashed node. However, the console warns me that replica copies doen't exist for all data on this node. How could this happen? Even if node A had already become master for some of the vbuckets, shouldn't the replica be on node B or C?

2. Node A will be rebooted shortly, does that improve the situation? make things worse?

What should we do at this point?

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Marty Schoch [X] (Inactive)

Reporter:: Marty Schoch [X] (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/Mar/13 8:10 AM

Updated:: 31/Aug/15 3:46 PM

Resolved:: 17/Jun/14 4:09 PM

Gerrit Reviews

There are no open Gerrit changes

Unsuccessful upgrade from 2.0.0 to 2.0.1

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty