Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-148

Issue with Observe API Persist.TWO and 1 dead node: Time Out when doing set operation

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.1-dp4
    • Fix Version/s: 1.1-beta
    • Component/s: Core
    • Security Level: Public
    • Labels:
      None
    • Environment:
      2 nodes cluster on couchbase-server-community_x86_2.0.0-1947-rel
        1 node on Ubuntu (VM)
        1 node on OS X

      Bucket configure with 1 replica

      Description

      I have a very simple Java program that connect to the 2 nodes and do a set with the following code:

      1. So I try to connect to multiple nodes

                  List<URI> couchbaseServerUris = new ArrayList<URI>();
                  couchbaseServerUris.add( new URI("http://192.168.0.108:8091/pools") );
                  couchbaseServerUris.add( new URI("http://192.168.0.104:8091/pools") );
                  CouchbaseClient client = new CouchbaseClient( couchbaseServerUris , "default" , "" );
      

      2. Then I call the set operation

       
              OperationFuture<Boolean> stored = client.set( "my-dummy-key",0, "{\"name\" : \"foo\", \"title\" : \"bar-test\"}", PersistTo.TWO);
      


      So everything is working as expected when the 2 nodes are up.

      When I kill 1 node (for example : disconnecting, or stopping, or pausing the Ubuntu VM) I have the following behavior:

      When I execute this program:
      1- I have an exception saying that 1 node is down : Expected behavior (even if we could avoid a long stack trace)

      2012-11-18 08:14:55.830 WARN com.couchbase.client.vbucket.ConfigurationProviderHTTP:  Connection problems with URI http://192.168.0.108:8091/pools ...skipping
      java.net.ConnectException: Host is down
      

      2- When I do the set the program is stopped/blocked until it reaches a network timeout
      2012-11-18 08:20:13.462 INFO com.couchbase.client.CouchbaseConnection: Shut down Couchbase client

      Error while storing : Observe Timeout - Polled Unsuccessfully for at least 40 seconds.
      2012-11-18 08:20:13.466 INFO 		 done : true
      		 done : {OperationStatus success=false:  Observe Timeout - Polled Unsuccessfully for at least 40 seconds.}
      com.couchbase.client.ViewNode:  Couchbase I/O reactor terminated
      2012-11-18 08:20:13.467 INFO com.couchbase.client.ViewNode:  Couchbase I/O reactor terminated
      

      Note that it is only happening with PersistTo.TWO
      if I use PersistTo.MASTER or PersistTo.ONE : the program is executed with no error and no stop
      if I use PersistTo.THREE ( or more) : the program is executed, no stop with the expected observe message : ( Error while storing : Requested persistence to 3 node(s), but only 2 are available.
      )

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        tgrall Tug Grall (Inactive) created issue -
        tgrall Tug Grall (Inactive) made changes -
        Field Original Value New Value
        Attachment CouchbaseSamples.zip [ 15836 ]
        tgrall Tug Grall (Inactive) made changes -
        Comment [ Sample program ]
        Hide
        tgrall Tug Grall (Inactive) added a comment -

        Sample program

        Show
        tgrall Tug Grall (Inactive) added a comment - Sample program
        tgrall Tug Grall (Inactive) made changes -
        Attachment CouchbaseSamples.zip [ 15837 ]
        tgrall Tug Grall (Inactive) made changes -
        Attachment CouchbaseSamples.zip [ 15837 ]
        tgrall Tug Grall (Inactive) made changes -
        Attachment CouchbaseSamples.zip [ 15836 ]
        tgrall Tug Grall (Inactive) made changes -
        Attachment CouchbaseSamples.zip [ 15838 ]
        Hide
        ingenthr Matt Ingenthron added a comment -

        I do believe that's actually expected behavior, but let's talk through it to get your opinion.

        We have a couple of options in the state of unexpected failure. one is we try our hardest to get the operation requested of us done and we rely on timeouts to keep from blocking forever. The second is that we keep tabs on our connections, and if the connection is down, we fail operations immediately so as to not have the application code waiting for something that may or may not succeed.

        Had you gone in and removed the second node (click 'remove' and 'rebalance'), then the client should have done something similar to when you requested three nodes. The failure you describe above is unexpected. Further, the client library doesn't really know if it's temporary or permanent.

        Finally, I do want to note, and I think this is well documented, that many things with Observe protocol under them end in timeouts. This is not the only one. Generally speaking, application code should be ready to do something in the case of a timeout.

        Show
        ingenthr Matt Ingenthron added a comment - I do believe that's actually expected behavior, but let's talk through it to get your opinion. We have a couple of options in the state of unexpected failure. one is we try our hardest to get the operation requested of us done and we rely on timeouts to keep from blocking forever. The second is that we keep tabs on our connections, and if the connection is down, we fail operations immediately so as to not have the application code waiting for something that may or may not succeed. Had you gone in and removed the second node (click 'remove' and 'rebalance'), then the client should have done something similar to when you requested three nodes. The failure you describe above is unexpected. Further, the client library doesn't really know if it's temporary or permanent. Finally, I do want to note, and I think this is well documented, that many things with Observe protocol under them end in timeouts. This is not the only one. Generally speaking, application code should be ready to do something in the case of a timeout.
        Hide
        ingenthr Matt Ingenthron added a comment -

        Tug explained this further. The PersistTo.THREE check must be happening after doing some operations, which is a bit late considering this operation can never succeed. The failure should be the same with a cluster that has a down node as it is with a cluster that just doesn't have a primary and to replica locations.

        Show
        ingenthr Matt Ingenthron added a comment - Tug explained this further. The PersistTo.THREE check must be happening after doing some operations, which is a bit late considering this operation can never succeed. The failure should be the same with a cluster that has a down node as it is with a cluster that just doesn't have a primary and to replica locations.
        ingenthr Matt Ingenthron made changes -
        Priority Major [ 3 ] Critical [ 2 ]
        Hide
        mikew Mike Wiederhold added a comment -

        The way Rags wrote this code originally was to do the set and then the observe. The observe part is the part that does all of the checking so the set will actually go through an then you will get the error. Similarly there is no checking for downed nodes and I don't think we actually have the ability to do this at the moment, but I may be wrong.

        On another note, one other thing I thing is wrong is returning an OperationFuture from all of the observe functions, but it isn't actually an asynchronous function.

        Show
        mikew Mike Wiederhold added a comment - The way Rags wrote this code originally was to do the set and then the observe. The observe part is the part that does all of the checking so the set will actually go through an then you will get the error. Similarly there is no checking for downed nodes and I don't think we actually have the ability to do this at the moment, but I may be wrong. On another note, one other thing I thing is wrong is returning an OperationFuture from all of the observe functions, but it isn't actually an asynchronous function.
        Show
        daschl Michael Nitschinger added a comment - http://review.couchbase.org/#/c/22936/
        daschl Michael Nitschinger made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        daschl Michael Nitschinger added a comment -

        fixed and will be available in the beta release.

        Show
        daschl Michael Nitschinger added a comment - fixed and will be available in the beta release.
        daschl Michael Nitschinger made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Fix Version/s 1.1beta [ 10370 ]
        Resolution Fixed [ 1 ]
        ingenthr Matt Ingenthron made changes -
        Workflow jira [ 21820 ] Couchbase SDK Workflow [ 38438 ]

          People

          • Assignee:
            daschl Michael Nitschinger
            Reporter:
            tgrall Tug Grall (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes