Uploaded image for project: 'Spymemcached Java Client'
  1. Spymemcached Java Client
  2. SPY-188

MemcacheClient.getBulk() doesn't use FailureMode

    XMLWordPrintable

Details

    • Task
    • Resolution: Unresolved
    • Major
    • None
    • 2.12.0
    • library
    • Security Level: Public
    • None

    Description

      Setup:
      -multiple memcached servers
      -ConnectionFactory.setFailureMode(FailureMode.Cancel) (or Retry)

      Conditions:
      -one of the memcached servers goes down, or is restarted

      Observations:
      -single key operations are immediately canceled (throws a CancellationException)
      -multi-key operations (getBulk()/asyncGetbulk()) do not get cancelled. Instead they will timeout on the inactive node.

      The cause seems to be the code in MemcachedClient.asyncGetBulk(): there is no check on the FailureMode value, only the node's active status. If a node is inactive, the code emulates the Redistribute failure mode (default failure mode).

      The attached patch checks the ConnectionFactory's failure mode, and emulates the behavior of MemcachedConnection.addOperation:
      -if the node is active or FailureMode is Retry, use the primary node
      -if the node is inactive and FailureMode is Cancel, don't create an operation (no value will be returned for that key)
      -otherwise, redistribute (existing default behavior)

      This patch is not perfect:
      -it relies on the ConnectionFactory failure mode, not the node's connection's FailureMode value (not visible); I'm pretty sure the values will be the same though.
      it doesn't throw a CancellationException if the FailureMode is Cancel, and a node is inactive: instead it behaves like a "cache miss" instead. This is a compromise. The code could throw a CancellationException when a node is down, but it seems very inefficient if a single key -out of many is currently inaccessible.

      This compromise is acceptable for us: we're looking for as little service impact as possible when one of our memcacehd servers goes down. The current behavior (timeout) causes a big pile-up and cascading timeouts.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            daschl Michael Nitschinger
            dcfeedly David Chatenay
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty