Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-192

Failure to handle host that has been removed from a cluster

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.4
    • Fix Version/s: 2.1.1
    • Component/s: library
    • Security Level: Public
    • Labels:
      None
    • Environment:
      Linux, Ubuntu 12.04 Precise

      Description

      We're encountering an issue where both the Ruby and Perl clients get() method will consistently, repeatedly just return an error. On the Perl client it is "Temporary error. try again later", and on Ruby it is "Bucket not found".

      This occurs when we have several couchbase servers, and have connected using the node list option to supply all of them to the library.

      We remove the first server in the node list via the Couchbase GUI, trigger a rebalance, and wait for it to complete.
      Once the rebalance has completed, we start a script that simply attempts to connect and retrieve some keys we previously inserted. Instead of success, we find that it repeatedly gives the errors mentioned above.

      note that if we remain connected to the cluster, then everything seems fine. The errors only start occurring when a client connects.

      note also that the issue only occurs if the first server (or servers) in the list are down; if the first server listed is currently actively in the cluster, then we're OK.

      It seems like the problem is that the now-removed server is still accepting connections on port 8091, and thus the client library thinks that this means it is a valid server. However because it's removed from the pool, it causes confusion and errors to the client.

      Obviously, the correct/desired behaviour would be for the client library to behave as if the removed-server was not accepting connections, and to move onto the next server in the node list.

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          ingenthr Matt Ingenthron added a comment -

          Others should comment as well, but our current expectation is that if you're removing a server from the cluster, you're also removing it from the bootstrap list. Thus, the best practice would be to remove it from the list first, then remove the server from the cluster.

          We can probably make this better though.

          Show
          ingenthr Matt Ingenthron added a comment - Others should comment as well, but our current expectation is that if you're removing a server from the cluster, you're also removing it from the bootstrap list. Thus, the best practice would be to remove it from the list first, then remove the server from the cluster. We can probably make this better though.
          Hide
          wintrmute Toby Corkindale added a comment -

          That is our current work-around. However it is rather annoying, and makes rolling updates/reboots an extremely time consuming process.
          ie. Update puppet with new node list, wait an hour for that to get deployed over every server, then rebalance and remove server1; apply system updates and reboot, then add back to cluster and rebalance. Then update puppet's node list to include that server, but not the next one, wait an hour for it to apply everywhere, etc etc etc.

          There's also the danger that someone will forget to update the managed node list one day and remove a server from the cluster, expecting it to Just Work.

          I also have to point out that the current behaviour is unexpected, and I don't recall seeing any warnings about it in the documentation or tutorials; thus it's likely other users may hit this in the future and have some accidental production downtime, as we did.

          Show
          wintrmute Toby Corkindale added a comment - That is our current work-around. However it is rather annoying, and makes rolling updates/reboots an extremely time consuming process. ie. Update puppet with new node list, wait an hour for that to get deployed over every server, then rebalance and remove server1; apply system updates and reboot, then add back to cluster and rebalance. Then update puppet's node list to include that server, but not the next one, wait an hour for it to apply everywhere, etc etc etc. There's also the danger that someone will forget to update the managed node list one day and remove a server from the cluster, expecting it to Just Work. I also have to point out that the current behaviour is unexpected, and I don't recall seeing any warnings about it in the documentation or tutorials; thus it's likely other users may hit this in the future and have some accidental production downtime, as we did.
          Hide
          ingenthr Matt Ingenthron added a comment -

          Yes, all good points-- we'll definitely look to make it better.

          Show
          ingenthr Matt Ingenthron added a comment - Yes, all good points-- we'll definitely look to make it better.
          Show
          avsej Sergey Avseyev added a comment - http://review.couchbase.org/28444
          Hide
          avsej Sergey Avseyev added a comment -

          So the fix actually aplied to libcouchbase, there is another setting, which allow libcouchbase to skip misconfigured nodes. It will be accessible in 2.1.1 and then I will expose this setting to ruby client RCBC-138

          Show
          avsej Sergey Avseyev added a comment - So the fix actually aplied to libcouchbase, there is another setting, which allow libcouchbase to skip misconfigured nodes. It will be accessible in 2.1.1 and then I will expose this setting to ruby client RCBC-138
          Hide
          ingenthr Matt Ingenthron added a comment -

          Why would we not skip misconfigured nodes by default?

          Show
          ingenthr Matt Ingenthron added a comment - Why would we not skip misconfigured nodes by default?
          Hide
          avsej Sergey Avseyev added a comment -

          Because when user give us list of nodes, we trust user. And treat this list as just ordered nodes of the same cluster, we cannot say "hey, node1 and node3, not from the cluster for application "foobar"". I think we should secure by default and report unexpected input argument

          Show
          avsej Sergey Avseyev added a comment - Because when user give us list of nodes, we trust user. And treat this list as just ordered nodes of the same cluster, we cannot say "hey, node1 and node3, not from the cluster for application "foobar"". I think we should secure by default and report unexpected input argument

            People

            • Assignee:
              avsej Sergey Avseyev
              Reporter:
              wintrmute Toby Corkindale
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes