Details
-
Task
-
Resolution: Unresolved
-
Major
-
None
-
2.12.0
-
Security Level: Public
-
None
Description
Setup:
-multiple memcached servers
-ConnectionFactory.setFailureMode(FailureMode.Cancel) (or Retry)
Conditions:
-one of the memcached servers goes down, or is restarted
Observations:
-single key operations are immediately canceled (throws a CancellationException)
-multi-key operations (getBulk()/asyncGetbulk()) do not get cancelled. Instead they will timeout on the inactive node.
The cause seems to be the code in MemcachedClient.asyncGetBulk(): there is no check on the FailureMode value, only the node's active status. If a node is inactive, the code emulates the Redistribute failure mode (default failure mode).
The attached patch checks the ConnectionFactory's failure mode, and emulates the behavior of MemcachedConnection.addOperation:
-if the node is active or FailureMode is Retry, use the primary node
-if the node is inactive and FailureMode is Cancel, don't create an operation (no value will be returned for that key)
-otherwise, redistribute (existing default behavior)
This patch is not perfect:
-it relies on the ConnectionFactory failure mode, not the node's connection's FailureMode value (not visible); I'm pretty sure the values will be the same though.
it doesn't throw a CancellationException if the FailureMode is Cancel, and a node is inactive: instead it behaves like a "cache miss" instead. This is a compromise. The code could throw a CancellationException when a node is down, but it seems very inefficient if a single key -out of many is currently inaccessible.
This compromise is acceptable for us: we're looking for as little service impact as possible when one of our memcacehd servers goes down. The current behavior (timeout) causes a big pile-up and cascading timeouts.