Uploaded image for project: 'Spymemcached Java Client'
  1. Spymemcached Java Client
  2. SPY-123

Race Condition on Selectors when adding/removing nodes

    Details

    • Type: Task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.8.12
    • Fix Version/s: 2.9.0
    • Component/s: library
    • Security Level: Public
    • Labels:
      None
    No reviews matched the request. Check your Options in the drop-down menu of this sections header.

      Activity

      Hide
      ingenthr Matt Ingenthron added a comment -

      Comments from Michael:

      I found that the keys() method around the selectors and the key set is not thread safe. My thought is that we should only access it from the IO thread but obviously there is something wrong here. I'll trace this back and report findings.
      Michael Nitschinger added a comment - 22/May/13 2:39 AM - edited

      Okay, I think I now whats going on here:

      In this section, we check if lots of timeouts occured on one of the nodes and if they do, they get reconnected. The code iterates over the selectors, which contain the nodes as attachments. Now we reconnect in there and also notify observers, maybe you see where this goes: my impression is that we modify the selection key set (adding new nodes?) while the code still iterates and does some work on the selectors.

      The easiest fix is to wrap it and check for this exception and if it happens just continue. I want to determine the root cause though, maybe we have a flaw in there that also affects the other CBSE.

      If you look at the selector management inside netty, they also seem to deal with this commonly (and we dont, currently). Like https://github.com/netty/netty/blob/master/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L254 and https://github.com/netty/netty/blob/master/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L407

      It may has to do with their huge timeout increase so they queue up (and thats why we havent seen it somewhere else reported). I'll write up a test script with their params to see if I can repro it locally.
      Michael Nitschinger added a comment - 22/May/13 3:26 AM

      Ha - i got it tracked down once - but I have no idea how.. didnt get to get it again. Captured the strack trace though

      2013-05-22 11:50:27.336 INFO com.couchbase.client.CouchbaseConnection: Connection state changed for sun.nio.ch.SelectionKeyImpl@6766afb3
      2013-05-22 11:50:27.360 INFO com.couchbase.client.ViewConnection: Added 192.168.56.101 to connect queue
      2013-05-22 11:50:27.361 INFO com.couchbase.client.CouchbaseClient: viewmode property isn't defined. Setting viewmode to production mode
      2013-05-22 11:51:46.988 INFO com.couchbase.client.CouchbaseConnection: Added

      {QA sa=/192.168.56.102:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=0}

      to connect queue
      Exception in thread "Memcached IO over

      {MemcachedConnection to 192.168.56.101/192.168.56.101:11210}

      " java.util.ConcurrentModificationException
      at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
      at java.util.HashMap$KeyIterator.next(HashMap.java:828)
      at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1008)
      at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:257)
      at com.couchbase.client.CouchbaseConnection.run(CouchbaseConnection.java:265)
      2013-05-22 11:51:48.337 INFO com.couchbase.client.ViewConnection: Added 192.168.56.102 to connect queue
      Exception in thread "main" net.spy.memcached.OperationTimeoutException: Timeout waiting for value
      at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1036)
      at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1051)
      at MainApp.main(MainApp.java:14)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
      Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: 192.168.56.101/192.168.56.101:11210
      at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:160)
      at net.spy.memcached.internal.GetFuture.get(GetFuture.java:62)
      at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1026)
      ... 7 more

      I'll propose some fixes on both spy and CBC that mitigate the race condition and leave the app in a running state.

      Michael Nitschinger added a comment - 22/May/13 3:45 AM

      SPY change: http://review.couchbase.org/#/c/26464
      corresponding JCBC change: http://review.couchbase.org/#/c/26465/

      If code review is good, I can provide a build to give it to the customers to verify. To me, the change looks pretty low impact.

      Show
      ingenthr Matt Ingenthron added a comment - Comments from Michael: I found that the keys() method around the selectors and the key set is not thread safe . My thought is that we should only access it from the IO thread but obviously there is something wrong here. I'll trace this back and report findings. Michael Nitschinger added a comment - 22/May/13 2:39 AM - edited Okay, I think I now whats going on here: In this section, we check if lots of timeouts occured on one of the nodes and if they do, they get reconnected. The code iterates over the selectors, which contain the nodes as attachments. Now we reconnect in there and also notify observers, maybe you see where this goes: my impression is that we modify the selection key set (adding new nodes?) while the code still iterates and does some work on the selectors. The easiest fix is to wrap it and check for this exception and if it happens just continue. I want to determine the root cause though, maybe we have a flaw in there that also affects the other CBSE. If you look at the selector management inside netty, they also seem to deal with this commonly (and we dont, currently). Like https://github.com/netty/netty/blob/master/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L254 and https://github.com/netty/netty/blob/master/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L407 It may has to do with their huge timeout increase so they queue up (and thats why we havent seen it somewhere else reported). I'll write up a test script with their params to see if I can repro it locally. Michael Nitschinger added a comment - 22/May/13 3:26 AM Ha - i got it tracked down once - but I have no idea how.. didnt get to get it again. Captured the strack trace though 2013-05-22 11:50:27.336 INFO com.couchbase.client.CouchbaseConnection: Connection state changed for sun.nio.ch.SelectionKeyImpl@6766afb3 2013-05-22 11:50:27.360 INFO com.couchbase.client.ViewConnection: Added 192.168.56.101 to connect queue 2013-05-22 11:50:27.361 INFO com.couchbase.client.CouchbaseClient: viewmode property isn't defined. Setting viewmode to production mode 2013-05-22 11:51:46.988 INFO com.couchbase.client.CouchbaseConnection: Added {QA sa=/192.168.56.102:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=0} to connect queue Exception in thread "Memcached IO over {MemcachedConnection to 192.168.56.101/192.168.56.101:11210} " java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793) at java.util.HashMap$KeyIterator.next(HashMap.java:828) at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1008) at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:257) at com.couchbase.client.CouchbaseConnection.run(CouchbaseConnection.java:265) 2013-05-22 11:51:48.337 INFO com.couchbase.client.ViewConnection: Added 192.168.56.102 to connect queue Exception in thread "main" net.spy.memcached.OperationTimeoutException: Timeout waiting for value at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1036) at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1051) at MainApp.main(MainApp.java:14) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node: 192.168.56.101/192.168.56.101:11210 at net.spy.memcached.internal.OperationFuture.get(OperationFuture.java:160) at net.spy.memcached.internal.GetFuture.get(GetFuture.java:62) at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:1026) ... 7 more I'll propose some fixes on both spy and CBC that mitigate the race condition and leave the app in a running state. Michael Nitschinger added a comment - 22/May/13 3:45 AM SPY change: http://review.couchbase.org/#/c/26464 corresponding JCBC change: http://review.couchbase.org/#/c/26465/ If code review is good, I can provide a build to give it to the customers to verify. To me, the change looks pretty low impact.
      Hide
      ingenthr Matt Ingenthron added a comment -

      Note that a ConcurrentModificationException can also happen if the same thread is trying to access a structure twice, say iterating and modifying.

      Given that this is a rare case, should we just be synchronized either on the node or the block? I looked at the code and I'm not 100% certain of the best way to handle it.

      Looking a bit further, during this iteration a bunch of different things can mark the node as disconnected. I don't see that in the stack above. But, it does make me wonder if going synchronized on the node may address the issue.

      Can you re-review with these comments in mind?

      Show
      ingenthr Matt Ingenthron added a comment - Note that a ConcurrentModificationException can also happen if the same thread is trying to access a structure twice, say iterating and modifying. Given that this is a rare case, should we just be synchronized either on the node or the block? I looked at the code and I'm not 100% certain of the best way to handle it. Looking a bit further, during this iteration a bunch of different things can mark the node as disconnected. I don't see that in the stack above. But, it does make me wonder if going synchronized on the node may address the issue. Can you re-review with these comments in mind?
      Hide
      daschl Michael Nitschinger added a comment -

      Merged, will be available in 2.9.0

      Show
      daschl Michael Nitschinger added a comment - Merged, will be available in 2.9.0

        People

        • Assignee:
          daschl Michael Nitschinger
          Reporter:
          daschl Michael Nitschinger
        • Votes:
          0 Vote for this issue
          Watchers:
          2 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Gerrit Reviews

            There are no open Gerrit changes