Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-207

incorrect logic in reconnection threshold leads to never actually reconnecting

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: None
    • Security Level: Public
    • Labels:
      None

      Description

      In the CouchbaseConnectionFactory, the pastReconnThreshold() method doesn't correctly check the threshold time. It's using millis mixed with nanos.

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            ingenthr Matt Ingenthron added a comment -

            This is a regression of JCBC-19.

            Show
            ingenthr Matt Ingenthron added a comment - This is a regression of JCBC-19 .
            Hide
            ingenthr Matt Ingenthron added a comment -

            It turns out this is not a regression. The way the test is being carried out is different in this case.

            so, I worked out why this java failover isn't working. it's related to using kill -STOP

            Here's the current behavior,
            there's a per-node continuious operation timeout threshold
            after a given node times out a bunch, the client will drop the connection to that node
            then it'll try to reestablish it
            meanwhile, there's another counter for how often we can't find an established connection to a node the config says we should be using
            that second one, the algorithm is 10 failures to find the node in a 10 second window means re-bootstrap
            so, the problem...
            is that when we kill -STOP (instead of an actual cable pull)
            you can still establish new connections to 11210
            so, we drop and reestablish, send a bunch of stuff, then drop and reestablish quickly

            but this algorithm that I'd tested with actual cable pulls will work with actual cable pulls, but it won't work (without big changes) in the sigstop case ingenthr
            because we consider the connection "good" at the time of established, not at the time of sending data
            maybe that's incorrect to do

            Show
            ingenthr Matt Ingenthron added a comment - It turns out this is not a regression. The way the test is being carried out is different in this case. so, I worked out why this java failover isn't working. it's related to using kill -STOP Here's the current behavior, there's a per-node continuious operation timeout threshold after a given node times out a bunch, the client will drop the connection to that node then it'll try to reestablish it meanwhile, there's another counter for how often we can't find an established connection to a node the config says we should be using that second one, the algorithm is 10 failures to find the node in a 10 second window means re-bootstrap so, the problem... is that when we kill -STOP (instead of an actual cable pull) you can still establish new connections to 11210 so, we drop and reestablish, send a bunch of stuff, then drop and reestablish quickly but this algorithm that I'd tested with actual cable pulls will work with actual cable pulls, but it won't work (without big changes) in the sigstop case ingenthr because we consider the connection "good" at the time of established, not at the time of sending data maybe that's incorrect to do
            Hide
            ingenthr Matt Ingenthron added a comment -

            I think I've worked out an approach with Mark Nunberg's help.

            We'll need to change spymemcached to verify the connection is actually good with a noop before calling it good. If it fails that, it'll go back to be reconnected. We may need backoff for this as well.

            Show
            ingenthr Matt Ingenthron added a comment - I think I've worked out an approach with Mark Nunberg's help. We'll need to change spymemcached to verify the connection is actually good with a noop before calling it good. If it fails that, it'll go back to be reconnected. We may need backoff for this as well.
            Hide
            daschl Michael Nitschinger added a comment -

            Just as a note, the changesets I've pushed were tested against "freezing" a VM.

            Show
            daschl Michael Nitschinger added a comment - Just as a note, the changesets I've pushed were tested against "freezing" a VM.

              People

              • Assignee:
                ingenthr Matt Ingenthron
                Reporter:
                ingenthr Matt Ingenthron
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes