Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4476

adding a node to a cluster which has the same otpCookie causes issues ( happens when vm is cloned or created from an AMI/VM_Template where couchbase was already installed)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.7.2
    • Fix Version/s: 2.0-beta
    • Component/s: ns_server
    • Security Level: Public
    • Environment:
      Membase 1.7.2 installed via deb on Ubuntu 11.10 64 bit running on an Amazon EC2 m2.xlarge instance.

      Description

      I'm seeing intermittent problems when building a cluster of Membase servers. Sometimes adding an additional server to a cluster will cause another server to be marked as unhealthy. Other times the additional server will function correctly but all others will appear to be in setup mode. This only happens some times, so it's tricky to nail down properly.

      Steps to reproduce this are in this forum thread: http://www.couchbase.org/forums/thread/server-marked-unhealthy-after-adding-additional-server-cluster

      In the most recent test run, I spun up 3 m2.xlarge instances as described in the forum post. On adding the third server to the cluster (from the admin web interface on the first server), the admin interface refreshed and showed the setup dialog. When I browsed to the second server it was also in setup mode. The third server was configured correctly and appeared to be running in its own cluster of one machine. I've attached the output from /logs on all three machines.

      (Forgive me if I have the wrong component selected, I'm not quite sure which one is causing me to see this issue.)

      1. membase_a_log.json
        6 kB
        Conor
      2. membase_a_ns-diag-20111130185447.txt
        985 kB
        Conor
      3. membase_b_log.json
        5 kB
        Conor
      4. membase_b_ns-diag-20111130185430.txt
        725 kB
        Conor
      5. membase_c_log.json
        6 kB
        Conor
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        if possible please attach the diags from the existing node and the node you added to the cluster.

        Show
        farshid Farshid Ghods (Inactive) added a comment - if possible please attach the diags from the existing node and the node you added to the cluster.
        Hide
        Conor Conor McDermottroe added a comment -

        Sorry, I should have described the attachments a little better.

        membase_a_log.json is the output from /log on the machine where I ran the setup via the web console.

        membase_b_log.json is the output from /log on the first machine which I added to the cluster. (After which the cluster was a functioning two machine cluster)

        membase_c_log.json is the output from /log on the second machine which I added to the cluster. (After which the cluster was broken)

        Are there other diagnostics which would be useful? I can re-run the test if necessary.

        Show
        Conor Conor McDermottroe added a comment - Sorry, I should have described the attachments a little better. membase_a_log.json is the output from /log on the machine where I ran the setup via the web console. membase_b_log.json is the output from /log on the first machine which I added to the cluster. (After which the cluster was a functioning two machine cluster) membase_c_log.json is the output from /log on the second machine which I added to the cluster. (After which the cluster was broken) Are there other diagnostics which would be useful? I can re-run the test if necessary.
        Hide
        Conor Conor McDermottroe added a comment -

        I've replicated the issue again, this time with two machines.

        I ran setup on A, which resulted in an apparently OK 1 node "cluster".

        I then added B to A which resulted in A being in setup mode and B being in a 1 node "cluster".

        I've attached the output of /diag and mbcollect_info from both machines.

        Show
        Conor Conor McDermottroe added a comment - I've replicated the issue again, this time with two machines. I ran setup on A, which resulted in an apparently OK 1 node "cluster". I then added B to A which resulted in A being in setup mode and B being in a 1 node "cluster". I've attached the output of /diag and mbcollect_info from both machines.
        Hide
        Conor Conor McDermottroe added a comment -

        Oh, and sorry for the naming. Just in case it's not 100% clear, the A and B in the second test with the output of /diag and mbcollect_info are not the same as the A and B in the first test.

        Show
        Conor Conor McDermottroe added a comment - Oh, and sorry for the naming. Just in case it's not 100% clear, the A and B in the second test with the output of /diag and mbcollect_info are not the same as the A and B in the first test.
        Hide
        Conor Conor McDermottroe added a comment -

        Was the information I attached above any use? I can re-run and gather additional information if you need.

        Show
        Conor Conor McDermottroe added a comment - Was the information I attached above any use? I can re-run and gather additional information if you need.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        It is useful, thanks a lot for taking time to get it and create ticket.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - It is useful, thanks a lot for taking time to get it and create ticket.
        Hide
        ingenthr Matt Ingenthron added a comment -

        Adding to 1.8.0 fixfor. Per discussion today, QE will look into this and determine if it still belongs slotted on 1.8 and triage priority/severity.

        Show
        ingenthr Matt Ingenthron added a comment - Adding to 1.8.0 fixfor. Per discussion today, QE will look into this and determine if it still belongs slotted on 1.8 and triage priority/severity.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited

        found whats happened. Thanks, Conor, again very much for reporting it.

        Something interesting happened. Both nodes have same initial erlang cookie. And that's causing them to communicate too early in join process. And node that's joining another node gets config from node that's being joined, sees config conflict and picks 'wrong' version. That's causing nodes_wanted with only new node (just joined), this causes original cluster node to leave cluster.

        This is very interesting and we haven't seen this before. Have you cloned VM ? Looks like this is possible in EC2 via custom images. And logs kind of confirm that. There is time jump of 8 days before last start of node.

        If not that could be due to nodes being launched at same time and not high clock resolution of EC2. Because initial cookie is generated by RNG, but rng is seeded with clock. Erlang itself has microsecond clock precision, but underlying kernel (and in case of Xen, underlying supervisor or Dom0 kernel) does not necessarily supports that. But that seems very unlikely, so I bet on cloning.

        In order to correctly fix this issue I need you to confirm that you cloned your VM (or not).

        Meanwhile, the following command can be used to re-init cookie of node (don't do that on nodes that are joined to cluster):

        wget O --post-data='NewCookie = ns_node_disco:cookie_gen(), ns_config:set(otp, [

        {cookie, NewCookie}

        ]).' --user=Administrator --password=asdasd http://lh:9000/diag/eval

        replace password with your admin password and host:port with your rest host:port (8091 is default port). Doing it on any of nodes prior to joining will likely fix your problem.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - - edited found whats happened. Thanks, Conor, again very much for reporting it. Something interesting happened. Both nodes have same initial erlang cookie. And that's causing them to communicate too early in join process. And node that's joining another node gets config from node that's being joined, sees config conflict and picks 'wrong' version. That's causing nodes_wanted with only new node (just joined), this causes original cluster node to leave cluster. This is very interesting and we haven't seen this before. Have you cloned VM ? Looks like this is possible in EC2 via custom images. And logs kind of confirm that. There is time jump of 8 days before last start of node. If not that could be due to nodes being launched at same time and not high clock resolution of EC2. Because initial cookie is generated by RNG, but rng is seeded with clock. Erlang itself has microsecond clock precision, but underlying kernel (and in case of Xen, underlying supervisor or Dom0 kernel) does not necessarily supports that. But that seems very unlikely, so I bet on cloning. In order to correctly fix this issue I need you to confirm that you cloned your VM (or not). Meanwhile, the following command can be used to re-init cookie of node (don't do that on nodes that are joined to cluster): wget O --post-data='NewCookie = ns_node_disco:cookie_gen(), ns_config:set(otp, [ {cookie, NewCookie} ]).' --user=Administrator --password=asdasd http://lh:9000/diag/eval replace password with your admin password and host:port with your rest host:port (8091 is default port). Doing it on any of nodes prior to joining will likely fix your problem.
        Hide
        Conor Conor McDermottroe added a comment -

        Thanks all for chasing this down.

        I created the AMI from a running instance after installing Membase from the deb, so I guess that's the issue.

        I'm going to test re-initializing the cookie after launch but before adding it to the cluster and see if that fixes the issue. I'll report back here with the results.

        Show
        Conor Conor McDermottroe added a comment - Thanks all for chasing this down. I created the AMI from a running instance after installing Membase from the deb, so I guess that's the issue. I'm going to test re-initializing the cookie after launch but before adding it to the cluster and see if that fixes the issue. I'll report back here with the results.
        Hide
        Conor Conor McDermottroe added a comment -

        If I re-initialize the Erlang cookie before adding a machine to the cluster I can't replicate the error.

        Looks good so far, thanks!

        Show
        Conor Conor McDermottroe added a comment - If I re-initialize the Erlang cookie before adding a machine to the cluster I can't replicate the error. Looks good so far, thanks!
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Good one to learn our clustering.

        Here we need to change node's uuid and cookie prior to joining. /engage.. seems like right place.

        This is against 1.8.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Good one to learn our clustering. Here we need to change node's uuid and cookie prior to joining. /engage.. seems like right place. This is against 1.8.
        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - http://review.couchbase.org/14062
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ns-server-2-0 #321 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/321/)
        reset node's cookie before joining cluster.MB-4476 (Revision 9fbf7871720192010e65e1b5610dece9383a0f30)

        Result = SUCCESS
        Aliaksey Kandratsenka :
        Files :

        • src/ns_cluster.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ns-server-2-0 #321 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/321/ ) reset node's cookie before joining cluster. MB-4476 (Revision 9fbf7871720192010e65e1b5610dece9383a0f30) Result = SUCCESS Aliaksey Kandratsenka : Files : src/ns_cluster.erl
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Actually fixed for 1.8.1 here: http://review.couchbase.org/14186

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Actually fixed for 1.8.1 here: http://review.couchbase.org/14186
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        done

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - done

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            Conor Conor McDermottroe
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes