Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-276

Client does not detect silently dying Streaming Node

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Won't Fix
    • Affects Version/s: 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4
    • Fix Version/s: 1.1.7
    • Component/s: Core
    • Security Level: Public
    • Labels:
      None

      Description

      When connected to the EPT/streaming node and the node is "frozen" or dies silently otherwise (doesn't force the closing of the chunked socket), the connection stays established.

      This can easily be reproduced outside of the client by connecting the browser to the streaming URL and then freezing a VM. The browser will still "spin" and wait for new chunks to come up.

      The proposed solution is to have a netty handler in place that raises a exception when there is not traffic for N number of seconds (like 30) over the streaming connection. After this is detected, we have two possibilities:

      • reconnect completely, but this involves lots of overhead every 30 seconds.
      • send a HTTP HEAD packet and only if this doesnt work out reconnect. This means in the normal case we only have a HTTP HEAD request sent every 30 seconds, not much overhead.
        If this fails, we then trigger the reconfigure.

      Netty has a ReadTimeoutHandler to help with this. My POC already kinda works, I just need to find a way to properly distinguish the HEAD response on the ResponseHandler from regular chunks that arrive from the same channel.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        ingenthr Matt Ingenthron added a comment -

        I had fixed this with couchbase buckets and we test for this in SDKQE. Is this possibly isolated to memcached buckets.

        Show
        ingenthr Matt Ingenthron added a comment - I had fixed this with couchbase buckets and we test for this in SDKQE. Is this possibly isolated to memcached buckets.
        Hide
        daschl Michael Nitschinger added a comment -

        Actually, this has been discovered while using Couchbase buckets. Let's chat about this, but I think thats a different issue and not related to it.

        Show
        daschl Michael Nitschinger added a comment - Actually, this has been discovered while using Couchbase buckets. Let's chat about this, but I think thats a different issue and not related to it.
        Hide
        ingenthr Matt Ingenthron added a comment -

        Sure, the solution previously was to have that threshold if we were getting unexpected failures. Once we pass that threshold, we'd try to re-subscribe.

        I'm not opposed to a heartbeat, but I don't think the HTTP HEAD is good, since the mochiweb erlang implementation is effectively the same as a GET. I'll reach you and chat through it.

        Show
        ingenthr Matt Ingenthron added a comment - Sure, the solution previously was to have that threshold if we were getting unexpected failures. Once we pass that threshold, we'd try to re-subscribe. I'm not opposed to a heartbeat, but I don't think the HTTP HEAD is good, since the mochiweb erlang implementation is effectively the same as a GET. I'll reach you and chat through it.
        Hide
        daschl Michael Nitschinger added a comment -

        Okay after talking this through with Matt, I reran the script to see if the proposed solution (increasing ops/s) fixed the problem.

        Interestingly, it turns out when doing set/get's it never hits our anticipated codepath but instead the get operations just time out, the threshold never gets increased and nothing happens. This is something we need to reinvestigate.

        Show
        daschl Michael Nitschinger added a comment - Okay after talking this through with Matt, I reran the script to see if the proposed solution (increasing ops/s) fixed the problem. Interestingly, it turns out when doing set/get's it never hits our anticipated codepath but instead the get operations just time out, the threshold never gets increased and nothing happens. This is something we need to reinvestigate.
        Hide
        daschl Michael Nitschinger added a comment -

        Closing this for now, because our fallback algorithm works as expected. The threshold bug never triggered has been resolved in a different ticket.

        Show
        daschl Michael Nitschinger added a comment - Closing this for now, because our fallback algorithm works as expected. The threshold bug never triggered has been resolved in a different ticket.

          People

          • Assignee:
            daschl Michael Nitschinger
            Reporter:
            daschl Michael Nitschinger
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes