Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Security Level: Public
    • Labels:
    • Environment:
      PHP 5.4.9-4ubuntu2.2
      Couchbase 2.1
      Couchbase PHP Extension 1.1.5
      libcouchbase2 2.0.7-2282

      Description

      On our test environment I've got three nodes on the local network in one cluster with one web server. I've created a virtual machine remotely to set up XDCR which worked perfectly.

      I then tried to imitate a datacentre going offline by unplugging the network cable from each of the three nodes in the local cluster.

      I'd already changed my PHP to be something along the lines of:

      define('COUCHBASE_SERVERS', couchbase1.office:8091;couchbase2.office:8091;couchbase3.office:8091;'remote.vnode.com:8091');
      define('COUCHBASE_USER', 'username');
      define('COUCHBASE_PASS', 'pass');
      define('COUCHBASE_DATA_BUCKET', 'data');
      define('COUCHBASE_WEB_BUCKET', 'web');
      define('COUCHBASE_SESSION_BUCKET', 'session');

      I have three classes, one to connect to the data bucket and perform view queries amongst others, one to connect to the session bucket and store sessions in JSON format and another to cache the generated HTML in the web bucket.

      Each class initiates the connection via:

      $couchbase = new Couchbase(COUCHBASE_SERVERS, COUCHBASE_USER, COUCHBASE_PASS, COUCHBASE_WEB_BUCKET);

      Obviously with the respective bucket.

      When I unplugged the three network cables and refreshed, I get this error:

      Fatal error: Uncaught exception 'CouchbaseLibcouchbaseException' with message 'Failed to get a value from server: Operation timed out'

      The PHP extension documentation states "The connection obtains the cluster configuration from the first host to which it has connected." which to me implies that there is some sort of loop that goes through each node to find an active node. It doesn't even seem like it's using the first node in the array as I moved the only active one to the top of the list and it still failed.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        glambert Graeme Lambert added a comment -

        On a side note, using Nginx and the "ngx_http_enhanced_memcached_module" to retrieve the cached html from the bucket works fine with the nodes down, it still hits the remote virtual machine, it's only when it gets through to the PHP it fails.

        Show
        glambert Graeme Lambert added a comment - On a side note, using Nginx and the "ngx_http_enhanced_memcached_module" to retrieve the cached html from the bucket works fine with the nodes down, it still hits the remote virtual machine, it's only when it gets through to the PHP it fails.
        Hide
        ingenthr Matt Ingenthron added a comment -

        I think there must be a missing portion of the description here. If you've unplugged the servers and leave them unplugged, timeouts are to be expected. Can you update the description to indicate the specific steps?

        Show
        ingenthr Matt Ingenthron added a comment - I think there must be a missing portion of the description here. If you've unplugged the servers and leave them unplugged, timeouts are to be expected. Can you update the description to indicate the specific steps?
        Hide
        glambert Graeme Lambert added a comment -

        Hi Matt,

        Not sure if I can edit the original description?

        Nothing really is missing, there is an array of 4 nodes being passed to the PHP extension and I've unplugged the network cable for the 3 in my office and left the remote on connected.

        I left the nodes disconnected from the network for a good half an hour and still had the fatal error returning so I don't think it's a brief timeout issue.

        Show
        glambert Graeme Lambert added a comment - Hi Matt, Not sure if I can edit the original description? Nothing really is missing, there is an array of 4 nodes being passed to the PHP extension and I've unplugged the network cable for the 3 in my office and left the remote on connected. I left the nodes disconnected from the network for a good half an hour and still had the fatal error returning so I don't think it's a brief timeout issue.
        Hide
        ingenthr Matt Ingenthron added a comment -

        I'm still missing it, as your latest comment indicates four nodes, but the initial description talks about three nodes. Perhaps you can put it in these terms...

        Scenario:
        Four node cluster. Autofailover enabled with the 30s default failure detection time. Three buckets, two couchbase type, one memcached type. Each Couchbase bucket is configured with one replica. One application server on a separate machine running blah-de-blah workload. Unplug three of four nodes and...

        Expected behavior:
        After 30s, autofailover kicks in and all is well.

        Observed behavior:
        Regular timeouts received at the application level from the client library while the three nodes are unplugged.

        • - -

        Note that what I describe above isn't actually the way it works, as autofailover works only for one node, so multiple node failures would have the cluster take no action and thus you'd have timeouts. You can get back into a working state by administratively failing over the three nodes, but at the cost of lost data.

        Show
        ingenthr Matt Ingenthron added a comment - I'm still missing it, as your latest comment indicates four nodes, but the initial description talks about three nodes. Perhaps you can put it in these terms... Scenario: Four node cluster. Autofailover enabled with the 30s default failure detection time. Three buckets, two couchbase type, one memcached type. Each Couchbase bucket is configured with one replica. One application server on a separate machine running blah-de-blah workload. Unplug three of four nodes and... Expected behavior: After 30s, autofailover kicks in and all is well. Observed behavior: Regular timeouts received at the application level from the client library while the three nodes are unplugged. - - Note that what I describe above isn't actually the way it works, as autofailover works only for one node, so multiple node failures would have the cluster take no action and thus you'd have timeouts. You can get back into a working state by administratively failing over the three nodes, but at the cost of lost data.
        Hide
        glambert Graeme Lambert added a comment -

        Hi Matt,

        There are three nodes in the cluster and one node in a remote cloud in it's own cluster and I've set up XDCR for the three node cluster to replicate onto the remote cloud cluster with the single node in it. I've got the XDCR working perfectly.

        I believe that you are supposed to be able to hit both datacentres concurrently with read/write requests and both will stay replicated.

        Therefore I want to make sure that when a datacentre goes down, the site will operate fine connecting to the remote cloud cluster and when the 3 node cluster comes back up it will start to receive the writes that the remote cluster has had during that time.

        Please bare in mind this is by no means a production use case, but I'm trying to get as close to it with servers in the office and remote cloud nodes for XDCR. I wouldn't have just a single node in that cluster but for testing it out it's all I need.

        Show
        glambert Graeme Lambert added a comment - Hi Matt, There are three nodes in the cluster and one node in a remote cloud in it's own cluster and I've set up XDCR for the three node cluster to replicate onto the remote cloud cluster with the single node in it. I've got the XDCR working perfectly. I believe that you are supposed to be able to hit both datacentres concurrently with read/write requests and both will stay replicated. Therefore I want to make sure that when a datacentre goes down, the site will operate fine connecting to the remote cloud cluster and when the 3 node cluster comes back up it will start to receive the writes that the remote cluster has had during that time. Please bare in mind this is by no means a production use case, but I'm trying to get as close to it with servers in the office and remote cloud nodes for XDCR. I wouldn't have just a single node in that cluster but for testing it out it's all I need.
        Hide
        mschoch Marty Schoch added a comment -

        Saw this in IRC and decided to take a look. I believe the problem is that the Couchbase client is being initialized with the address of server nodes that actually belong to 2 different clusters. My understanding is that this is not supported. You would need to configure 2 separate Couchbase clients, each configured to talk to one of the clusters. Managing which interactions go to which cluster is then up to your application.

        Show
        mschoch Marty Schoch added a comment - Saw this in IRC and decided to take a look. I believe the problem is that the Couchbase client is being initialized with the address of server nodes that actually belong to 2 different clusters. My understanding is that this is not supported. You would need to configure 2 separate Couchbase clients, each configured to talk to one of the clusters. Managing which interactions go to which cluster is then up to your application.

          People

          • Assignee:
            ingenthr Matt Ingenthron
            Reporter:
            glambert Graeme Lambert
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes