Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7926

Unsuccessful upgrade from 2.0.0 to 2.0.1

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • 3.0
    • 2.0
    • tools
    • Security Level: Public
    • None
    • 3 node cluster:

      Linux mango-010 2.6.18-274.3.1.el5xen #1 SMP Tue Sep 6 20:57:11 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
      CentOS release 5.7 (Final)
      each has 4 cores, 6G ram
      4 buckets

    Description

      Upgrade Mango Cluster from 2.0.0 ee to 2.0.1 ee

      Cluster Info:

      3 nodes (labeled A,B,and C below), each node has 4 cores and 6g of ram

      A - 10.1.3.248 (currently down)
      B - 10.1.3.251
      C - 10.1.4.6

      Sequence of events

      1. Original plan was to do a swap rebalance online upgrade. However, our spare node lacked sufficient ram to join the cluster.

      2. Next, plan was to upgrade one node at a time, rebalancing nodes in and out, keeping the cluster online.

      3. Node A was rebalanced out of the cluster. This took about 2 - 2.5 hours. Occasionally the UI would indicate a failed XHR, but it was always succeed when retried. The operation was successful.

      4. Node A was upgraded as follows.

      a. /etc/init.d/couchbase-server stop
      b. verified all couch processes stopped
      c. rpm -e couchbase-server
      d. rm -rf /opt/couchbase
      e. rpm -ivh <couchbase 2.0.1 rpm file>

      5. Node A was added to the cluster

      6. Rebalance was started

      7. About an hour later I noticed that the console said "rebalance failed". I wasn't watching the console the whole time so I can't describe how far it got or if there were timeouts occuring...

      8. The error log in the UI indicated:

      Control connection to memcached on 'ns_1@10.1.3.248' disconnected: {{badmatch,
      {error,
      closed}},
      [

      {mc_client_binary, cmd_binary_vocal_recv, 5}

      ,

      {mc_client_binary, select_bucket, 2}

      ,

      {ns_memcached, ensure_bucket, 2}

      ,

      {ns_memcached, handle_info, 2}

      ,

      {gen_server, handle_msg, 5}

      ,

      {proc_lib, init_p_do_apply, 3}

      ]} (repeated 2 times)

      I did not dig deeper into the logs at this point.

      9. To my surprise, the node was a part of the cluster. 3 of our 4 buckets were indicating they had 3 servers. 1 bucket inidicated it only had 2 servers. I wasn't aware rebalance could fail in this way and I assumed my only option was to remove the node, rebalance, and try again.

      10. Node A was removed from the cluster.

      11. Rebalance was started.

      12. Rebalance was succesful, all buckets report 2 servers.

      13. At this point I decided to try again.

      14. Node A was added to the cluster.

      15. Rebalance was started. This time I sat and watched the whole thing.

      16. Rebalance made to 50% (on all nodes) without any apparent issues.

      17. Sometime after this it appeared that XHR requests were timing out to node A.

      18. Shortly after that it appeared that node A had crashed. Other nodes in the cluster consider it down. It responds to pings, but I cannot ssh into the server.

      19. At this point the cluster is still operational. The console shows 3 servers, but 1 is down. I see 4 buckets, one is 2/2 of servers, and the other three servers show 2/3 servers.

      The primary question is what to do at this point...

      1. Normally I would have expected to failover the crashed node. However, the console warns me that replica copies doen't exist for all data on this node. How could this happen? Even if node A had already become master for some of the vbuckets, shouldn't the replica be on node B or C?

      2. Node A will be rebooted shortly, does that improve the situation? make things worse?

      What should we do at this point?

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            mschoch Marty Schoch [X] (Inactive)
            mschoch Marty Schoch [X] (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty