Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-64

PHP Client reports Time Out errors on simple SET and GET operations when Rebalance is in Progress

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.2
    • Fix Version/s: 1.0.4
    • Component/s: library
    • Security Level: Public
    • Labels:
    • Environment:
      Centos5x OR Ubuntu 11x running Couchbase 1.8.0. 3 Node Cluster, plenty of available RAM and DISK. PHP 5.3. libcouchbase version 1.0.3-1

      Description

      This bug is reported by a customer. Whenever they issue a rebalance(after removing a node from the cluster or add a new node), client operations(PHP - sample code attached) time out with the error below

      PHP Warning: Couchbase::set(): Failed to store a value to server: Operation timed out in /root/test_worker.php on line <n>

      This issue is consistent for both the customer as well as our in-house reproduction in Support. There are no errors on the server side. Rebalance does complete. And once rebalance is complete, the same script runs fine without any errors.

      1. test_worker.php
        0.2 kB
        Hari Subramaniam
      2. report.pdf
        75 kB
        Sergey Avseyev
      # Subject Project Status CR V
      For Gerrit Dashboard: &For+CCBC-64=message:CCBC-64

        Activity

        Hide
        ingenthr Matt Ingenthron added a comment - - edited

        PHP Client Library Version installed?
        libcouchbase version installed?

        CentOS Version?

        Also needed is a description of what kind of rebalance is happening here. A "failover" click followed by a "rebalance" is different than a "remove node" followed by a "rebalance". Please outline the specific steps you'd used to reproduce.

        Show
        ingenthr Matt Ingenthron added a comment - - edited PHP Client Library Version installed? libcouchbase version installed? CentOS Version? Also needed is a description of what kind of rebalance is happening here. A "failover" click followed by a "rebalance" is different than a "remove node" followed by a "rebalance". Please outline the specific steps you'd used to reproduce.
        Hide
        ingenthr Matt Ingenthron added a comment -

        I do see the PHP client version info, sorry I'd missed that. I do still need the libcouchbase version info though.

        Initial investigation shows this is likely an issue in libcouchbase, but we need more about the type of rebalance being done. Please comment.

        Show
        ingenthr Matt Ingenthron added a comment - I do see the PHP client version info, sorry I'd missed that. I do still need the libcouchbase version info though. Initial investigation shows this is likely an issue in libcouchbase, but we need more about the type of rebalance being done. Please comment.
        Hide
        hari Hari Subramaniam (Inactive) added a comment -

        updated Environment w/OS versions + libcouchbase versions.

        Show
        hari Hari Subramaniam (Inactive) added a comment - updated Environment w/OS versions + libcouchbase versions.
        Hide
        ingenthr Matt Ingenthron added a comment -

        After some investigation with Sergey, it seems that during the 2.5 second period we give to try and re-try an operation if we receive a not-my-vbucket response, we're retrying constantly but the operation times out and thus we send this response up to the PHP client.

        To verify this, Sergey set the timeout to 10 seconds, and verified that the issue could not be seen. This is abnormally high though and would indicate that there is something wrong at the server side. There is no reason we should see more than 2.5 seconds for a vbucket to transfer from one node to another during rebalance.

        For another level of verification, we'll check to be sure the client is trying both the place the config states the vbucket is active and the place that the ffwd map states the vbucket is going. If this shows libcouchbase is behaving correctly, then the problem is at the server side, not the client side. It's possible there's a bug in the server here that moxi would mask with it's less sophisticated retry algorithm.

        It's equally possible that there is a libcouchbase configuration update problem. This next level of verification should tell us for sure.

        Note, PHP does not currently give the user the ability to raise the timeout to 10 seconds, but I don't think that's the right solution here. If a single vbucket transfer is taking longer than that, it needs to be addressed at the server side.

        Show
        ingenthr Matt Ingenthron added a comment - After some investigation with Sergey, it seems that during the 2.5 second period we give to try and re-try an operation if we receive a not-my-vbucket response, we're retrying constantly but the operation times out and thus we send this response up to the PHP client. To verify this, Sergey set the timeout to 10 seconds, and verified that the issue could not be seen. This is abnormally high though and would indicate that there is something wrong at the server side. There is no reason we should see more than 2.5 seconds for a vbucket to transfer from one node to another during rebalance. For another level of verification, we'll check to be sure the client is trying both the place the config states the vbucket is active and the place that the ffwd map states the vbucket is going. If this shows libcouchbase is behaving correctly, then the problem is at the server side, not the client side. It's possible there's a bug in the server here that moxi would mask with it's less sophisticated retry algorithm. It's equally possible that there is a libcouchbase configuration update problem. This next level of verification should tell us for sure. Note, PHP does not currently give the user the ability to raise the timeout to 10 seconds, but I don't think that's the right solution here. If a single vbucket transfer is taking longer than that, it needs to be addressed at the server side.
        Hide
        avsej Sergey Avseyev added a comment -

        I've analyzed the packets sent to the server and found that libcouchbase doesn't send out in time to network the corrected packet after NOT_MY_VBUCKET error. It is copying it into the internal ouput buffer, but it doesn't hit network. Although if increase timeout it working. So the issue in libcouchbase layer and it isn't php specific. Complete rebalance log could be found here http://files.avsej.net/add-node-rebalance.dump (~200M)

        Show
        avsej Sergey Avseyev added a comment - I've analyzed the packets sent to the server and found that libcouchbase doesn't send out in time to network the corrected packet after NOT_MY_VBUCKET error. It is copying it into the internal ouput buffer, but it doesn't hit network. Although if increase timeout it working. So the issue in libcouchbase layer and it isn't php specific. Complete rebalance log could be found here http://files.avsej.net/add-node-rebalance.dump (~200M)
        Hide
        ingenthr Matt Ingenthron added a comment -

        A fix has been produced, reviewed. http://review.couchbase.org/#change,15882

        Will send a verification of the fix to support for either support or the customer to verify the fix, then will include it in the next patch update.

        Show
        ingenthr Matt Ingenthron added a comment - A fix has been produced, reviewed. http://review.couchbase.org/#change,15882 Will send a verification of the fix to support for either support or the customer to verify the fix, then will include it in the next patch update.
        Hide
        avsej Sergey Avseyev added a comment -

        The patch was merged

        Show
        avsej Sergey Avseyev added a comment - The patch was merged
        Hide
        hari Hari Subramaniam (Inactive) added a comment -

        Fix has been verified by the customer. Considering this bug as addressed.

        Show
        hari Hari Subramaniam (Inactive) added a comment - Fix has been verified by the customer. Considering this bug as addressed.

          People

          • Assignee:
            jan Jan Lehnardt (Inactive)
            Reporter:
            hari Hari Subramaniam (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes