Uploaded image for project: 'Java Couchbase JVM Core'
  1. Java Couchbase JVM Core
  2. JVMCBC-534

PooledService creates excessive endpoints on sending to downed node

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.2, 1.5.5, 1.5.6, 1.5.7, 1.5.8
    • 1.6.0, 1.5.9
    • Core
    • None
    • 1

    Description

      I found that when QueryService tries to send a query to a shutdown server, a huge number of endpoints are created, saturating the system resources.

      In detail, here's what happens:

      The N1QL query enters into QueryService::send.

      QueryService::send checks its 'endpoints'.size() + pendingRequests, both of which are 0, so it decides to open an endpoint and calls maybeOpenAndSend.

      maybeOpenAndSend creates an endpoint, but doesn't add it to 'endpoints'. It's only going to do this if the endpoint successfully connects.   It also ++pendingRequests - this is what keeps track of endpoints that are not yet connected.

      The created endpoint then times out after 32 msecs, as the node is down.  In AbstractEndpoint::doConnect it logs "Could not connect to remote socket", sets the state to disconnected, and calls the observerable's onError.  It then goes on to try again with an exponential backup. So each individual endpoint works as expected.

      The trouble is, the endpoint's onError is setup to call QueryService::unsubscribeAndRetry.  This decrements the pendingRequest back to 0.  So now we have a problem where QueryService thinks it has no endpoints in progress - but in fact the endpoint still exists and is still trying to make the request.

      The N1QL query comes into QueryService again and we go through the loop once more.  So we end up spawning many endpoints and not tracking them in either 'endpoints' or pendingRequest.

      Eventually another 300 second timeout fires, which stops endpoints firing indefinitely and cleans up the existing endpoints.

      From code inspection, it appears 1.4.2 through 1.5.8 (current as of this writing) are affected.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          graham.pople Graham Pople added a comment -

          I have a fix for the issue, passing all automated tests plus manual testing. I'll take another look at it with a fresh brain next Tuesday (Monday is a national holiday) before submitting it for code review, as this is in a delicate and crucial code path.

          graham.pople Graham Pople added a comment - I have a fix for the issue, passing all automated tests plus manual testing. I'll take another look at it with a fresh brain next Tuesday (Monday is a national holiday) before submitting it for code review, as this is in a delicate and crucial code path.
          graham.pople Graham Pople added a comment -

          Changes in for review now.

          graham.pople Graham Pople added a comment - Changes in for review now.
          graham.pople Graham Pople added a comment -

          Changes merged.

          graham.pople Graham Pople added a comment - Changes merged.

          Missing fix version.

          ingenthr Matt Ingenthron added a comment - Missing fix version.

          As I commented when I reopened it, it didn't have a fix version.

          ingenthr Matt Ingenthron added a comment - As I commented when I reopened it, it didn't have a fix version.

          Ah ok, then I'm closing this now - thanks!

          daschl Michael Nitschinger added a comment - Ah ok, then I'm closing this now - thanks!

          People

            daschl Michael Nitschinger
            graham.pople Graham Pople
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty