Uploaded image for project: 'Java Couchbase JVM Core'
  1. Java Couchbase JVM Core
  2. JVMCBC-445

ArrayOutOfBoundException in PooledService#sendFlush (concurrent access on list)

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.5.0, 1.4.8
    • Core
    • None

    Description

      Hey,

      PooledService#sendFlush iterates over "endpoints" with an index. But it can happen, concurrently, that someone clears/changes "endpoints". While changes are guarded via epMutex, reading in sendFlush is not.

      IMHO, fix is to make a copy of endpoints (with synchronized(epMutex)) and then iterate over the copy to send the signal.

      This is the stacktrace:

      2017-08-07 16:56:10.479 [cb-core-3-2] WARN com.couchbase.client.core.CouchbaseCore - Exception while Handling Request Events RequestEvent{request=null}
      java.lang.ArrayIndexOutOfBoundsException: 0
      at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:387)
      at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:396)
      at com.couchbase.client.core.service.PooledService.sendFlush(PooledService.java:409)
      at com.couchbase.client.core.service.PooledService.send(PooledService.java:315)
      at com.couchbase.client.core.node.CouchbaseNode.send(CouchbaseNode.java:183)
      at com.couchbase.client.core.RequestHandler.flush(RequestHandler.java:211)
      at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:201)
      at com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:73)
      at com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
      at com.couchbase.client.deps.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
      at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          okr2014 okr2014 added a comment - - edited

          The issue is basically, at some point in time, the size is determined from endpoints list. But when accessing endpoints list later again with an index, endpoints list might have changed. So the whole purpose of CopyOnWriteArrayList is defeated.

          Another solution without copy/mutex is then to simply use the Iterator from endpoints. It will work on the latest list in CopyOnWriteArrayList.

          okr2014 okr2014 added a comment - - edited The issue is basically, at some point in time, the size is determined from endpoints list. But when accessing endpoints list later again with an index, endpoints list might have changed. So the whole purpose of CopyOnWriteArrayList is defeated. Another solution without copy/mutex is then to simply use the Iterator from endpoints. It will work on the latest list in CopyOnWriteArrayList.

          Thanks for that issue, I agree this is a bug.

          By the way, how did you come across this in practice? Curious how to reproduce...

          daschl Michael Nitschinger added a comment - Thanks for that issue, I agree this is a bug. By the way, how did you come across this in practice? Curious how to reproduce...
          okr2014 okr2014 added a comment -

          Accidentally. I think it was either a shutdown or a startup, and i just was curious about the exception in the logfile.

          I have 3 servers. Each of them has a couchbase instance. And then i fill the couchbase with binary data and a replication factor of 2, so that each instance gets a copy. And then i assume (i never validated that assumption), that the client uses the shortest path to get the data (locally), and in case one instance goes down, it still can get the data remotely.

          When setting up DefaultCouchbaseEnvironment via the builder, i set 5 queryEndpoints (connections per node) and a response buffer size of 30mb. And then i set a custom transcoder for the binary data, when opening the bucket.

          What i store as binary data is a bigger object, that is split up into chunks, because couchbase values are limited in size.

          And when i query for that bigger object, i actually fire async queries for the chunks in parallel, and then consuming the returning results immediately to build the bigger object and return it as a whole, once all queries have returned. Each query also uses a RetryWhenFunction for TemporaryFailure and Backpressure.

          I query a lot of these bigger objects in parallel. So overall the system can become quite busy (a lot of backpressure). And then i think, i shutdown or started up the system. I am not sure.

          That is basically it. I hope it helps.

          okr2014 okr2014 added a comment - Accidentally. I think it was either a shutdown or a startup, and i just was curious about the exception in the logfile. I have 3 servers. Each of them has a couchbase instance. And then i fill the couchbase with binary data and a replication factor of 2, so that each instance gets a copy. And then i assume (i never validated that assumption), that the client uses the shortest path to get the data (locally), and in case one instance goes down, it still can get the data remotely. When setting up DefaultCouchbaseEnvironment via the builder, i set 5 queryEndpoints (connections per node) and a response buffer size of 30mb. And then i set a custom transcoder for the binary data, when opening the bucket. What i store as binary data is a bigger object, that is split up into chunks, because couchbase values are limited in size. And when i query for that bigger object, i actually fire async queries for the chunks in parallel, and then consuming the returning results immediately to build the bigger object and return it as a whole, once all queries have returned. Each query also uses a RetryWhenFunction for TemporaryFailure and Backpressure. I query a lot of these bigger objects in parallel. So overall the system can become quite busy (a lot of backpressure). And then i think, i shutdown or started up the system. I am not sure. That is basically it. I hope it helps.

          Yep, I think the best solution is to just use the stable iterator from the copy on write list. I'll get this up.

          daschl Michael Nitschinger added a comment - Yep, I think the best solution is to just use the stable iterator from the copy on write list. I'll get this up.
          daschl Michael Nitschinger added a comment - http://review.couchbase.org/#/c/82036

          People

            daschl Michael Nitschinger
            okr2014 okr2014
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty