Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-257

During rebalance client tries to connect the primary node only

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.2.6
    • Fix Version/s: 1.2.8
    • Component/s: library
    • Labels:
      None

      Description

      I'm adding this bug to identify performance issue that is raised in CBSE-521 and CBSE-528

      It is observed during the sdkd scenario tests, that while rebalance is happening, the client tries to connect only the primary node and does not connect to the other secondary nodes in the cluster. During rebalance the topology changes and hence many errors like socket reset, no response received, operation time out, etc.
      These errors go away when the rebalance is over and with rebound phase, no errors are observed.
      Please see some sample reports:
      http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_fo-ept-rb-Sdotnet-1.2-release-T2013-04-02-00.11.35-LV_MC_BASIC.txt
      http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_rb-2-in-Sdotnet-1.2-release-T2013-04-02-00.21.03-LV_HTTP_BASIC.txt
      http://sdk-testresults.couchbase.com.s3.amazonaws.com/sdkd/HWIN-335SPEPOCGT-IHYBRID_fo-ept-eject-Sdotnet-1.2-release-T2013-04-02-00.17.30-LV_HTTP_BASIC.txt

      Mark - need your input here too, do you think these errors during rebalance can impact performance or stability at customer site.

      # Subject Project Status CR V
      For Gerrit Dashboard: &For+NCBC-257=message:NCBC-257

        Activity

        Hide
        saakshi.manocha Saakshi Manocha added a comment -

        Also, as per the documentation and our understanding, we can expect errors during CHANGE phase and ideally they should go away in REBOUND phase.

        CHANGE: Here we see that errors start happening. This is because a cluster topology change started around this time. We can expect errors until the topology change is completed. In this case, the topology change was adding a single node to the cluster.
        REBOUND: Here we see the errors are stopping. This is because the topology change has been completed. Since we added an extra node to the
        cluster, the rate of operations has actually gone up from before. This is because there are more nodes to handle requests now.

        Show
        saakshi.manocha Saakshi Manocha added a comment - Also, as per the documentation and our understanding, we can expect errors during CHANGE phase and ideally they should go away in REBOUND phase. CHANGE: Here we see that errors start happening. This is because a cluster topology change started around this time. We can expect errors until the topology change is completed. In this case, the topology change was adding a single node to the cluster. REBOUND: Here we see the errors are stopping. This is because the topology change has been completed. Since we added an extra node to the cluster, the rate of operations has actually gone up from before. This is because there are more nodes to handle requests now.
        Hide
        ingenthr Matt Ingenthron added a comment -

        This appears to be a critical issue. Marking as blocker for 1.2.7 until we have a better understanding.

        Show
        ingenthr Matt Ingenthron added a comment - This appears to be a critical issue. Marking as blocker for 1.2.7 until we have a better understanding.
        Hide
        john John Zablocki (Inactive) added a comment -

        When you say "connect to the primary node only" are you referring to the streaming connection or all ops are going on the primary node?

        Show
        john John Zablocki (Inactive) added a comment - When you say "connect to the primary node only" are you referring to the streaming connection or all ops are going on the primary node?
        Hide
        saakshi.manocha Saakshi Manocha added a comment -

        I ran the sdkd tests on a 4-node cluster. During the fail-over/rebalance phase, client automatically considers one node as the primary node and throughout the logs, the error is:
        System.IO.IOException: Failed to read from the socket '10.3.3.206:11210'. Error: SocketError value was Success, but 0 bytes were received

        It only tries to connect to the primary node, never tries to connect to the other nodes, and once the primary node is up and the rebalance is over, the error rate slows down.

        Show
        saakshi.manocha Saakshi Manocha added a comment - I ran the sdkd tests on a 4-node cluster. During the fail-over/rebalance phase, client automatically considers one node as the primary node and throughout the logs, the error is: System.IO.IOException: Failed to read from the socket '10.3.3.206:11210'. Error: SocketError value was Success, but 0 bytes were received It only tries to connect to the primary node, never tries to connect to the other nodes, and once the primary node is up and the rebalance is over, the error rate slows down.
        Hide
        mcatanzariti Michael Catanzariti added a comment -

        Any news about that one?
        we are experimenting the same issue when adding or removing a node into/from the cluster of 3 nodes during a load test (10000 concurrent users)

        It seems that the client library is returning null on CouchbaseClient.GetWithCas only during the rebalance operation (a few minutes) as if it could not find existing documents.
        once the rebalance operation is over, the driver returns correctly the existing documents

        Show
        mcatanzariti Michael Catanzariti added a comment - Any news about that one? we are experimenting the same issue when adding or removing a node into/from the cluster of 3 nodes during a load test (10000 concurrent users) It seems that the client library is returning null on CouchbaseClient.GetWithCas only during the rebalance operation (a few minutes) as if it could not find existing documents. once the rebalance operation is over, the driver returns correctly the existing documents
        Hide
        mcatanzariti Michael Catanzariti added a comment -

        Hi, We just trace the code of library and it seems the problems occurs in the method CouchbaseClient.ExecuteWithRedirect in the following section

        if (iows.State == OperationState.InvalidVBucket)
        {
        var nodes = this.Pool.GetWorkingNodes();

        foreach (var node in nodes)
        {
        opResult = node.Execute(op);
        ....

        when the cluster is rebalancing, the nodes could be all disposed and the node.Execute method returns an error for ALL the nodes

        Our further investigations lead us to think that a there is a race condition between disposing nodes when the driver receives a new config from the cluster and the execution of requests by the client.
        Indeed in the method CouchbasePool.ReconfigurePool the statement "Interlocked.Exchange(ref this.state, state);" does not protect the method CouchbaseClient.ExecuteWithRedirect to get the old nodes.
        The nodes could be the current ones in the statement "var nodes = this.Pool.GetWorkingNodes();" and the line after they could be already disposed by the listener thread.

        I hope to be clear enough

        Show
        mcatanzariti Michael Catanzariti added a comment - Hi, We just trace the code of library and it seems the problems occurs in the method CouchbaseClient.ExecuteWithRedirect in the following section if (iows.State == OperationState.InvalidVBucket) { var nodes = this.Pool.GetWorkingNodes(); foreach (var node in nodes) { opResult = node.Execute(op); .... when the cluster is rebalancing, the nodes could be all disposed and the node.Execute method returns an error for ALL the nodes Our further investigations lead us to think that a there is a race condition between disposing nodes when the driver receives a new config from the cluster and the execution of requests by the client. Indeed in the method CouchbasePool.ReconfigurePool the statement "Interlocked.Exchange(ref this.state, state);" does not protect the method CouchbaseClient.ExecuteWithRedirect to get the old nodes. The nodes could be the current ones in the statement "var nodes = this.Pool.GetWorkingNodes();" and the line after they could be already disposed by the listener thread. I hope to be clear enough
        Hide
        ingenthr Matt Ingenthron added a comment -

        Thanks for the investigation information Michael, this should help us more quickly get to the bottom of it.

        Show
        ingenthr Matt Ingenthron added a comment - Thanks for the investigation information Michael, this should help us more quickly get to the bottom of it.
        Hide
        jmorris Jeff Morris added a comment -

        I am looking into the issue and I see a couple of places in the CouchbasePool class that are suspect. I'll dig deeper into this and follow up with a resolution asap.

        Show
        jmorris Jeff Morris added a comment - I am looking into the issue and I see a couple of places in the CouchbasePool class that are suspect. I'll dig deeper into this and follow up with a resolution asap.
        Show
        jmorris Jeff Morris added a comment - http://review.couchbase.org/#/c/29197/

          People

          • Assignee:
            jmorris Jeff Morris
            Reporter:
            saakshi.manocha Saakshi Manocha
          • Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes