Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
2.8.0-dp
-
None
-
Security Level: Public
-
None
Description
The IO handling will wait forever if the downstream does not reply. With membase, this can happen when the node is pending or unhealthy. We normally use the continuous operation timeout counter to determine there is a problem and then dump the connection cancelling all outstanding operations, but this can fail.
If, for instance, we never look at the OperationFuture result, we'll not timeout.
I think a simple test of this would look like the following:
- Start some sink listening on a socket (netcat -l would do fine for this)
- Start a client who does some sets, followed by regularly doing synchronous gets forever
Expected behavior: after a time, things would start to work.
Observed behavior: things hang for many, many minutes (provable) and probably forever (unproveable).
I believe the problem is around line 241 in MemcachedConnection.java:
long delay=0;
if(!reconnectQueue.isEmpty())
getLogger().debug("Selecting with delay of %sms", delay);
assert selectorsMakeSense() : "Selectors don't make sense.";
int selected=selector.select(delay);
Set<SelectionKey> selectedKeys=selector.selectedKeys();
The problem with this is that the delay is set to forever, but if the operation written never responds, and no one checks for the timeout, it'll just be sitting there.
I'm not certain if a 0 timeout ever makes sense. If it's disconnected, doing the exponential backoff makes sense, but waiting forever seems like it could be a problem unless we change where we count timeouts.
In practice, there are probably few situations where ops are requested and ignored, so the continuous operation timeout would probably handle it in most of those cases. It can happen (think bulk loading!) though, so it's probably good to address it.