Details
-
Task
-
Resolution: Won't Do
-
Critical
-
None
-
1.4.2
-
Security Level: Public
-
None
Description
This is similar to http://www.couchbase.com/communities/q-and-a/java-client-not-aware-about...
This is using the 1.4.2 java client.
If I failover from the admin console while the node is still up... it does not reproduce. I found the best way to simulate the node suddenly becoming unresponsive is to use IP tables to block all traffic except SSH on port 22 like so:
To block a node:
iptables -A INPUT -p tcp --dport 22 -j ACCEPT; iptables -A INPUT -j DROP
To unblock a node:
iptables -F
It also seems to not reproduce if I remove the nested try/catch (i.e. don't try to read from the replica).
Failover seems to not be instantaneous... it takes 1-2 minutes with my hardware and setup. The following steps can be seen with the code below (might be reproducible with less steps but this seems consistent):
1) Create a two node cluster with 1 level of replication
2) Set the code below with the proper host names, bucket name and password
3) Run the code and you will see "Got From Master" 5 times
4) It will then pause and ask you to block traffic from the master node
5) Look at the admin console to see which node is the master for the key
6) Block the master node with iptables and then hit 'Enter'
7) Go back to the output and it will then output "Got From Replica" 10 times
8) It will then pause and ask for you to go to the admin console (on the replica node)
9) Wait until the master node is marked as "Down"
10) Once it is marked as "Down", fail over the master node
11) Go back to the console and hit 'Enter'
At this point the console should continue printing "Got From Replica". If you look at the admin console the replica node still has 0 items active and 1 item replicated. After 1-3 minutes it should suddenly say 1 item is active and 0 items are replicated (failover seems delayed). You will also notice at the same time that an exception started showing up in the console.
Expected: Once the node is fully failed over, it should no longer need to read from the replica and should read from the promoted master
Observed: It doesn't seem to be able to read from the master or the replica. It appears that the client is not marking the promoted replica as the new master.
Questions:
1) What is going on during the failover? I would have thought that failover would have been very fast and not take 1-5 minutes. Especially since I only have one item in the store
2) Anyone know of a workaround? If I catch the exception and rebuild the client... it works. But this would be horrible since the client is accessed by multiple threads.