Loading...

XML

Word

Printable

Details

Type: Task
Resolution: Won't Do
Priority: Critical
Fix Version/s: None
Affects Version/s: 1.4.2
Component/s: Core
Security Level: Public
Labels:
None

Description

This is similar to http://www.couchbase.com/communities/q-and-a/java-client-not-aware-about...

This is using the 1.4.2 java client.

If I failover from the admin console while the node is still up... it does not reproduce. I found the best way to simulate the node suddenly becoming unresponsive is to use IP tables to block all traffic except SSH on port 22 like so:

To block a node:
iptables -A INPUT -p tcp --dport 22 -j ACCEPT; iptables -A INPUT -j DROP

To unblock a node:
iptables -F

It also seems to not reproduce if I remove the nested try/catch (i.e. don't try to read from the replica).

Failover seems to not be instantaneous... it takes 1-2 minutes with my hardware and setup. The following steps can be seen with the code below (might be reproducible with less steps but this seems consistent):

1) Create a two node cluster with 1 level of replication
2) Set the code below with the proper host names, bucket name and password
3) Run the code and you will see "Got From Master" 5 times
4) It will then pause and ask you to block traffic from the master node
5) Look at the admin console to see which node is the master for the key
6) Block the master node with iptables and then hit 'Enter'
7) Go back to the output and it will then output "Got From Replica" 10 times
8) It will then pause and ask for you to go to the admin console (on the replica node)
9) Wait until the master node is marked as "Down"
10) Once it is marked as "Down", fail over the master node
11) Go back to the console and hit 'Enter'

At this point the console should continue printing "Got From Replica". If you look at the admin console the replica node still has 0 items active and 1 item replicated. After 1-3 minutes it should suddenly say 1 item is active and 0 items are replicated (failover seems delayed). You will also notice at the same time that an exception started showing up in the console.

Expected: Once the node is fully failed over, it should no longer need to read from the replica and should read from the promoted master

Observed: It doesn't seem to be able to read from the master or the replica. It appears that the client is not marking the promoted replica as the new master.

Questions:

1) What is going on during the failover? I would have thought that failover would have been very fast and not take 1-5 minutes. Especially since I only have one item in the store
2) Anyone know of a workaround? If I catch the exception and rebuild the client... it works. But this would be horrible since the client is accessed by multiple threads.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

anotherMultithreadTest.java
3 kB
18/Jun/14 9:45 PM
CouchbaseClientTester.java
2 kB
10/Jun/14 12:50 AM
CouchbaseClientTesterFastAndThreaded.java
3 kB
11/Jun/14 1:03 PM
CouchbaseClientTesterNewer.java
2 kB
11/Jun/14 10:32 AM
logAfterSeemingFreeze.txt
30 kB
11/Jun/14 1:05 PM
threadDumpDuringFreeze.txt
7 kB
13/Jun/14 2:07 PM
threadDumpDuringFreezeSunJVM.txt
7 kB
13/Jun/14 2:19 PM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Michael Nitschinger

Reporter:: Robert Waite

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Jun/14 12:47 AM

Updated:: 21/Jun/22 7:34 AM

Resolved:: 21/Jun/22 7:32 AM

Gerrit Reviews

There are no open Gerrit changes

Java client not aware of failed over node under certain circumstances

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty