Uploaded image for project: 'Java DCP Client'
  1. Java DCP Client
  2. JDCP-97

Java DCP client could not reconnect to Couchbase hosted on Docker Container

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.17.0
    • Fix Version/s: 0.20.0
    • Labels:
      None
    • Environment:
      Couchbase Docker image on MacOS and Couchbase Docker image on Openshift platform
    • Story Points:
      1

      Description

      We are using Couchbase Docker image both to develop on laptop and on our DEV and QA environment.

      We are developing on Mac OS and our DEV and QA environment work with OpenShift v3.7.1 (Kubernetes v1.7.6).

      On preproduction and production environment, Couchbase is deployed on VM.

      • When using Java DCP client 0.17.0 or older version hosted in Docker, Java DCP client always failed to reconnect to Couchbase container once we restart the container.
      • When using Java DCP client 0.17.0 or older version hosted on a VM (or directly on laptop), Java DCP client always reconnect to Couchbase container once we restart Couchbase.

      Step to reproduce 

      This issue occur both with Couchbase 4.6.5 and Couchbase 5.1.0 (and probably all others versions)

      1. Run Couchbase in Docker container as described in configuration (https://developer.couchbase.com/documentation/server/current/install/getting-started-docker.html) 
      2. Init the Couchbase server and install the sample bucket "travel-sample"
      3. Run the exemple described in the Readme on GitHub https://github.com/couchbase/java-dcp-client after increasing the sleep timeout to several minutes (see Main.java in attachement)
      4. Make a mutation on a document in "travel-sample" bucket though the Couchbase administrator interface
      5. The mutation is printed by the java program
      6. Stop your Couchbase Docker container
      7. Start your Couchbase Docker container few second later 
      8. Make a mutation on a document in travel-sample bucket though the Couchbase administrator interface
      9. The mutation is not printed by the java program

      When doing the same test without Couchbase on Docker container, step 9 print the mutation.

      Thanks,

       

       

        Attachments

          Issue Links

          For Gerrit Dashboard: JDCP-97
          # Subject Branch Project Status CR V

            Activity

            Hide
            llagatie Ludovic LAGATIE added a comment -

            You should read "

            • When using Java DCP client 0.17.0 or older version hosted on a VM (or directly on laptop), Java DCP client always reconnect to Couchbase once we restart Couchbase.

            " and not "

            • When using Java DCP client 0.17.0 or older version hosted on a VM (or directly on laptop), Java DCP client always reconnect to Couchbase container once we restart Couchbase."

            Sorry

            Show
            llagatie Ludovic LAGATIE added a comment - You should read " When using Java DCP client 0.17.0 or older version hosted on a VM (or directly on laptop), Java DCP client always reconnect to Couchbase once we restart Couchbase. " and not " When using Java DCP client 0.17.0 or older version hosted on a VM (or directly on laptop), Java DCP client always reconnect to Couchbase container once we restart Couchbase." Sorry
            Hide
            david.nault David Nault added a comment - - edited

            Current theory is that Docker is proxying the connection to the server inside the container, but is failing to propagate connection resets to clients outside the container.

            The DCP client supports dead connection detection, but this feature must be enabled by calling:

            Client.configure()
                .controlParam(DcpControl.Names.ENABLE_NOOP, "true")
                // other options...
            

            The default noop interval is 2 minutes. Dead connections will be detected after twice this interval (so 4 minutes in the default case). For quicker detection, set a noop interval as low as 20 seconds:

            Client.configure()
                .controlParam(DcpControl.Names.ENABLE_NOOP, "true")
                .controlParam(DcpControl.Names. SET_NOOP_INTERVAL, 20) // min recommended
                // other options...
            

            Show
            david.nault David Nault added a comment - - edited Current theory is that Docker is proxying the connection to the server inside the container, but is failing to propagate connection resets to clients outside the container. The DCP client supports dead connection detection, but this feature must be enabled by calling: Client.configure() .controlParam(DcpControl.Names.ENABLE_NOOP, "true") // other options... The default noop interval is 2 minutes. Dead connections will be detected after twice this interval (so 4 minutes in the default case). For quicker detection, set a noop interval as low as 20 seconds: Client.configure() .controlParam(DcpControl.Names.ENABLE_NOOP, "true") .controlParam(DcpControl.Names. SET_NOOP_INTERVAL, 20) // min recommended // other options...
            Hide
            ascione.d Daniele Ascione added a comment - - edited

            Ludovic LAGATIE I was trying to reproduce the issue following the steps you're described, and I automitized a bit the process for the test; you can find the code in this repo.
            Even without activating the noop (as David Nault suggested to do) I was able to retrieve the update from couchbase in a container after it's been restarted, so I cannot reproduce the bug.
            You have said that you're using openshift, so maybe you're using oc services (and endpoints) in order to connect the containers, which are probably running as pods.
            Can you please provide some info about it?

            Show
            ascione.d Daniele Ascione added a comment - - edited Ludovic LAGATIE I was trying to reproduce the issue following the steps you're described, and I automitized a bit the process for the test; you can find the code  in this repo . Even without activating the noop (as  David Nault suggested to do) I was able to retrieve the update from couchbase in a container after it's been restarted, so I cannot reproduce the bug. You have said that you're using openshift, so maybe you're using oc services (and endpoints) in order to connect the containers, which are probably running as pods. Can you please provide some info about it?
            Hide
            llagatie Ludovic LAGATIE added a comment - - edited

            Hello Daniele Ascione,

            I get your test project and I was not able to reproduce the problem too.

            I ran your java client exemple outside the docker container and I was able to reproduce it again.

              dcp-client-example git(master) export CB_HOSTNAME=localhost;export BUCKET=travel-sample;export CB_USER=Administrator;export CB_PASSWORD=password;export IS_NOOP=false;export MINUTES_WAITING=5              

            **  dcp-client-example git(master) java -jar target/dcp-client-example-jar-with-dependencies.jar

            Starting the client with NOOP desabled, with a sleep time of: 300000 ms 

            Jul 23, 2018 4:30:48 PM com.couchbase.client.dcp.Client <init>

            INFOS: Environment Configuration Used: ClientEnvironment{clusterAt=[localhost/127.0.0.1:8091], connectionNameGenerator=DefaultConnectionNameGenerator, bucket='travel-sample', passwordSet=true, dcpControl=DcpControl{{connection_buffer_size=20480}}, eventLoopGroup=NioEventLoopGroup, eventLoopGroupIsPrivate=true, poolBuffers=true, bufferAckWatermark=60, connectTimeout=1, bootstrapTimeout=5000, sslEnabled=false, sslKeystoreFile='null', sslKeystorePassword=false, sslKeystore=null}

            Jul 23, 2018 4:30:48 PM com.couchbase.client.dcp.Client connect

            INFOS: Connecting to seed nodes and bootstrapping bucket travel-sample.

            Jul 23, 2018 4:30:48 PM com.couchbase.client.dcp.conductor.DcpChannel$2$1 operationComplete

            INFOS: Connected to Node localhost/127.0.0.1:11210

            Jul 23, 2018 4:30:49 PM com.couchbase.client.dcp.Client startStreaming

            INFOS: Starting to Stream for 1024 partitions

            [ Perform an update on couchbase via web-ui ]

            Mutation: MutationMessage [key: "airline_10", vbid: 361, cas: 1532356269559709696, bySeqno: 43, revSeqno: 10, flags: 33554438, expiry: 0, lockTime: 0, clength: 150]

            [ restarting couchbase container in an other terminal "docker stop couchbase-server; sleep 10s; docker start couchbase-server" ]

            Jul 23, 2018 4:31:33 PM com.couchbase.client.dcp.conductor.DcpChannel dispatchReconnect

            INFOS: Node localhost/127.0.0.1:11210 socket closed, initiating reconnect.

            [ Perform an update on couchbase via web-ui ]

            [ No more output ]

             

            As you say, we are using openshift, with oc services (and endpoints) in order to connect the Couchbase container, which is running as pod. And we have the same effect as described above.

             

            Show
            llagatie Ludovic LAGATIE added a comment - - edited Hello  Daniele Ascione , I get your test project and I was not able to reproduce the problem too. I ran your java client exemple outside the docker container and I was able to reproduce it again. ➜   dcp-client-example git( master ) export CB_HOSTNAME=localhost;export BUCKET=travel-sample;export CB_USER=Administrator;export CB_PASSWORD=password;export IS_NOOP=false;export MINUTES_WAITING=5               ** ➜   dcp-client-example git( master ) java -jar target/dcp-client-example-jar-with-dependencies.jar Starting the client with NOOP desabled, with a sleep time of: 300000 ms  Jul 23, 2018 4:30:48 PM com.couchbase.client.dcp.Client <init> INFOS: Environment Configuration Used: ClientEnvironment{clusterAt= [localhost/127.0.0.1:8091] , connectionNameGenerator=DefaultConnectionNameGenerator, bucket='travel-sample', passwordSet=true, dcpControl=DcpControl{{connection_buffer_size=20480}}, eventLoopGroup=NioEventLoopGroup, eventLoopGroupIsPrivate=true, poolBuffers=true, bufferAckWatermark=60, connectTimeout=1, bootstrapTimeout=5000, sslEnabled=false, sslKeystoreFile='null', sslKeystorePassword=false, sslKeystore=null} Jul 23, 2018 4:30:48 PM com.couchbase.client.dcp.Client connect INFOS: Connecting to seed nodes and bootstrapping bucket travel-sample. Jul 23, 2018 4:30:48 PM com.couchbase.client.dcp.conductor.DcpChannel$2$1 operationComplete INFOS: Connected to Node localhost/127.0.0.1:11210 Jul 23, 2018 4:30:49 PM com.couchbase.client.dcp.Client startStreaming INFOS: Starting to Stream for 1024 partitions [ Perform an update on couchbase via web-ui ] Mutation: MutationMessage [key: "airline_10", vbid: 361, cas: 1532356269559709696, bySeqno: 43, revSeqno: 10, flags: 33554438, expiry: 0, lockTime: 0, clength: 150] [ restarting couchbase container in an other terminal "docker stop couchbase-server; sleep 10s; docker start couchbase-server" ] Jul 23, 2018 4:31:33 PM com.couchbase.client.dcp.conductor.DcpChannel dispatchReconnect INFOS: Node localhost/127.0.0.1:11210 socket closed, initiating reconnect. [ Perform an update on couchbase via web-ui ] [ No more output ]   As you say, we are using openshift, with oc services (and endpoints) in order to connect the Couchbase container, which is running as pod. And we have the same effect as described above.  
            Hide
            david.nault David Nault added a comment -

            Still see this as important. Moving the fix version to 0.20.0 because it's not going to make the 0.19.0 train.

            Show
            david.nault David Nault added a comment - Still see this as important. Moving the fix version to 0.20.0 because it's not going to make the 0.19.0 train.
            Hide
            david.nault David Nault added a comment -

            Experimentation shows the DCP client is immediately aware of the container shutdown, but the reconnect attempt hangs (the connect future listener is never invoked – not even with a failure). The connect timeout of 1 seconds doesn't appear to have any effect.

            The Couchbase Java SDK has some code that implements a "safeguard timeout"... might have to borrow that.

            Show
            david.nault David Nault added a comment - Experimentation shows the DCP client is immediately aware of the container shutdown, but the reconnect attempt hangs (the connect future listener is never invoked – not even with a failure). The connect timeout of 1 seconds doesn't appear to have any effect. The Couchbase Java SDK has some code that implements a "safeguard timeout"... might have to borrow that.
            Hide
            david.nault David Nault added a comment - - edited

            Note to self: investigate `ConnectInterceptingHandler` as a possible culprit; it does some fancy promise manipulation; is it possible that the underlying promise failure is not being propagated to the replacement promise?

            Show
            david.nault David Nault added a comment - - edited Note to self: investigate `ConnectInterceptingHandler` as a possible culprit; it does some fancy promise manipulation; is it possible that the underlying promise failure is not being propagated to the replacement promise?
            Hide
            david.nault David Nault added a comment -

            ConnectInterceptingHandler is indeed the culprit. Netty is deactivating the channel before the handshake completes (or even begins, really). In this case, the original connect promise is left unfulfilled.

            The fix is to respond to channel deactivation by failing the connect promise.

            Show
            david.nault David Nault added a comment - ConnectInterceptingHandler is indeed the culprit. Netty is deactivating the channel before the handshake completes (or even begins, really). In this case, the original connect promise is left unfulfilled. The fix is to respond to channel deactivation by failing the connect promise.

              People

              Assignee:
              david.nault David Nault
              Reporter:
              llagatie Ludovic LAGATIE
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty