Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-467

Java client not aware of failed over node under certain circumstances

    XMLWordPrintable

Details

    • Task
    • Resolution: Won't Do
    • Critical
    • None
    • 1.4.2
    • Core
    • Security Level: Public
    • None

    Description

      This is similar to http://www.couchbase.com/communities/q-and-a/java-client-not-aware-about...

      This is using the 1.4.2 java client.

      If I failover from the admin console while the node is still up... it does not reproduce. I found the best way to simulate the node suddenly becoming unresponsive is to use IP tables to block all traffic except SSH on port 22 like so:

      To block a node:
      iptables -A INPUT -p tcp --dport 22 -j ACCEPT; iptables -A INPUT -j DROP

      To unblock a node:
      iptables -F

      It also seems to not reproduce if I remove the nested try/catch (i.e. don't try to read from the replica).

      Failover seems to not be instantaneous... it takes 1-2 minutes with my hardware and setup. The following steps can be seen with the code below (might be reproducible with less steps but this seems consistent):

      1) Create a two node cluster with 1 level of replication
      2) Set the code below with the proper host names, bucket name and password
      3) Run the code and you will see "Got From Master" 5 times
      4) It will then pause and ask you to block traffic from the master node
      5) Look at the admin console to see which node is the master for the key
      6) Block the master node with iptables and then hit 'Enter'
      7) Go back to the output and it will then output "Got From Replica" 10 times
      8) It will then pause and ask for you to go to the admin console (on the replica node)
      9) Wait until the master node is marked as "Down"
      10) Once it is marked as "Down", fail over the master node
      11) Go back to the console and hit 'Enter'

      At this point the console should continue printing "Got From Replica". If you look at the admin console the replica node still has 0 items active and 1 item replicated. After 1-3 minutes it should suddenly say 1 item is active and 0 items are replicated (failover seems delayed). You will also notice at the same time that an exception started showing up in the console.

      Expected: Once the node is fully failed over, it should no longer need to read from the replica and should read from the promoted master

      Observed: It doesn't seem to be able to read from the master or the replica. It appears that the client is not marking the promoted replica as the new master.

      Questions:

      1) What is going on during the failover? I would have thought that failover would have been very fast and not take 1-5 minutes. Especially since I only have one item in the store
      2) Anyone know of a workaround? If I catch the exception and rebuild the client... it works. But this would be horrible since the client is accessed by multiple threads.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            daschl Michael Nitschinger
            winstonwaite Robert Waite
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty