Uploaded image for project: 'Java Couchbase JVM Core'
  1. Java Couchbase JVM Core
  2. JVMCBC-435

Issue with number of java client connections increasing rapidly after fail over on single node in cluster

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.5.0, 1.4.8
    • Core
    • None
    • Client version: 2.4.6 (Java)
      Server version: 4.5.0

    Description

      I'm creating this bug from the following Couchbase forum post that has yet to receive any real attention:

       

      https://forums.couchbase.com/t/issue-with-number-of-java-client-connections-increasing-rapidly-after-fail-over-on-single-node-in-cluster/13133

       

      This seems like a race condition somewhere in the code that handles connecting/reconnecting to nodes.  This is currently a blocker for us and we need a solution or at least someone to confirm that this seems to be an issue.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          I looked a bit deeper, but couldn't find anything. Michael Nitschinger, can you have a look?

          ingenthr Matt Ingenthron added a comment - I looked a bit deeper, but couldn't find anything. Michael Nitschinger , can you have a look?
          daschl Michael Nitschinger added a comment - - edited

          Upon initial investigation of the posted server.log, authentication exceptions are very prevalent (1099 times), this might be the reason for the many reconnects and increasing number of connections. I'll try to reproduce it locally on virtual machines and report back what I find.

          Yes, both from the log and from a reproduced sample I could see that the auth errors are coming from the SDK trying to connect/reconnect to the node that has been failed over and is not part of the cluster at this point. interestingly - as mentioned - the auth errors start when the rebalance starts, not right after the hard failover.

          daschl Michael Nitschinger added a comment - - edited Upon initial investigation of the posted server.log, authentication exceptions are very prevalent (1099 times), this might be the reason for the many reconnects and increasing number of connections. I'll try to reproduce it locally on virtual machines and report back what I find. Yes, both from the log and from a reproduced sample I could see that the auth errors are coming from the SDK trying to connect/reconnect to the node that has been failed over and is not part of the cluster at this point. interestingly - as mentioned - the auth errors start when the rebalance starts, not right after the hard failover.

          I think I know whats going on, this is a regression caused by https://issues.couchbase.com/browse/JVMCBC-415 which has been merged for 1.4.5. So going to 1.4.4 jvm-core (2.4.4 java-client) should be a temporary workaround, I'll have the proper fix for 2.5.0.

          The carrier refresher poll flooring was "global" and not accounting on a per-bucket basis. This has the consequence that N bucket refreshes come in (one for each bucket) but only the first will ever go through, leaving the other ones behind with stale configs. This in turn causes the "old" node to hang around longer than it should be, leading to reconnect attempts and subsequent auth failures (and repeatedly opened socket attempts). The sockets should close at some point but of course its not intended to have them opened like this in the first place.

          I think I'll fix this by introducing a poll floor on a per-bucket basis, then we should be good.

          daschl Michael Nitschinger added a comment - I think I know whats going on, this is a regression caused by https://issues.couchbase.com/browse/JVMCBC-415 which has been merged for 1.4.5. So going to 1.4.4 jvm-core (2.4.4 java-client) should be a temporary workaround, I'll have the proper fix for 2.5.0. The carrier refresher poll flooring was "global" and not accounting on a per-bucket basis. This has the consequence that N bucket refreshes come in (one for each bucket) but only the first will ever go through, leaving the other ones behind with stale configs. This in turn causes the "old" node to hang around longer than it should be, leading to reconnect attempts and subsequent auth failures (and repeatedly opened socket attempts). The sockets should close at some point but of course its not intended to have them opened like this in the first place. I think I'll fix this by introducing a poll floor on a per-bucket basis, then we should be good.
          daschl Michael Nitschinger added a comment - http://review.couchbase.org/#/c/82917

          Fixed on master and will be available in 2.5.0 - thanks for reporting it and sorry for taking a bit longer than expected!

          daschl Michael Nitschinger added a comment - Fixed on master and will be available in 2.5.0 - thanks for reporting it and sorry for taking a bit longer than expected!

          People

            daschl Michael Nitschinger
            cbax007 cbax007
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty