Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-207

incorrect logic in reconnection threshold leads to never actually reconnecting

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: None
    • Security Level: Public
    • Labels:
      None

      Description

      In the CouchbaseConnectionFactory, the pastReconnThreshold() method doesn't correctly check the threshold time. It's using millis mixed with nanos.

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          daschl Michael Nitschinger added a comment -

          Just as a note, the changesets I've pushed were tested against "freezing" a VM.

          Show
          daschl Michael Nitschinger added a comment - Just as a note, the changesets I've pushed were tested against "freezing" a VM.
          Hide
          ingenthr Matt Ingenthron added a comment -

          I think I've worked out an approach with Mark Nunberg's help.

          We'll need to change spymemcached to verify the connection is actually good with a noop before calling it good. If it fails that, it'll go back to be reconnected. We may need backoff for this as well.

          Show
          ingenthr Matt Ingenthron added a comment - I think I've worked out an approach with Mark Nunberg's help. We'll need to change spymemcached to verify the connection is actually good with a noop before calling it good. If it fails that, it'll go back to be reconnected. We may need backoff for this as well.
          Hide
          ingenthr Matt Ingenthron added a comment -

          It turns out this is not a regression. The way the test is being carried out is different in this case.

          so, I worked out why this java failover isn't working. it's related to using kill -STOP

          Here's the current behavior,
          there's a per-node continuious operation timeout threshold
          after a given node times out a bunch, the client will drop the connection to that node
          then it'll try to reestablish it
          meanwhile, there's another counter for how often we can't find an established connection to a node the config says we should be using
          that second one, the algorithm is 10 failures to find the node in a 10 second window means re-bootstrap
          so, the problem...
          is that when we kill -STOP (instead of an actual cable pull)
          you can still establish new connections to 11210
          so, we drop and reestablish, send a bunch of stuff, then drop and reestablish quickly

          but this algorithm that I'd tested with actual cable pulls will work with actual cable pulls, but it won't work (without big changes) in the sigstop case ingenthr
          because we consider the connection "good" at the time of established, not at the time of sending data
          maybe that's incorrect to do

          Show
          ingenthr Matt Ingenthron added a comment - It turns out this is not a regression. The way the test is being carried out is different in this case. so, I worked out why this java failover isn't working. it's related to using kill -STOP Here's the current behavior, there's a per-node continuious operation timeout threshold after a given node times out a bunch, the client will drop the connection to that node then it'll try to reestablish it meanwhile, there's another counter for how often we can't find an established connection to a node the config says we should be using that second one, the algorithm is 10 failures to find the node in a 10 second window means re-bootstrap so, the problem... is that when we kill -STOP (instead of an actual cable pull) you can still establish new connections to 11210 so, we drop and reestablish, send a bunch of stuff, then drop and reestablish quickly but this algorithm that I'd tested with actual cable pulls will work with actual cable pulls, but it won't work (without big changes) in the sigstop case ingenthr because we consider the connection "good" at the time of established, not at the time of sending data maybe that's incorrect to do
          Hide
          ingenthr Matt Ingenthron added a comment -

          This is a regression of JCBC-19.

          Show
          ingenthr Matt Ingenthron added a comment - This is a regression of JCBC-19 .

            People

            • Assignee:
              ingenthr Matt Ingenthron
              Reporter:
              ingenthr Matt Ingenthron
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes