Details
-
Bug
-
Resolution: Fixed
-
Critical
-
5.5.0
-
5.5.0-2814
-
Untriaged
-
-
Unknown
-
CX Sprint 106
Description
1. We have 2 analytics nodes and 2 KV nodes in the cluster.
2. Loading 600 items per second in 4 CB buckets with expiry set as 3600s. With these settings there won't by more that 1.5M items in any bucket at any given point of time once doc count reach to that lever gradually.
3. 240 queries are shot on the analytics nodes and ensured that there will always be 240 queries running/queued on analytics.
4. After couple of days of this run all the queries started timing out after 300s which were passing earlier.
Lots of HYracks exception can be seen in the logs.
5. Flush operation also performed suspecting that queries might be scanning a lot of data and hence timing out but after flushing also none os the query were succeeding.
|
2018-06-03T09:16:49.008-07:00 WARN CBAS.buffercache.BufferCache [Executor-82328:20b52be6564d5ba1fdc5a30027a20449] Failure while trying to read a page from disk
|
org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.InterruptedException
|
at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:5.5.0-2814]
|
at org.apache.hyracks.control.nc.io.IoRequest.queue(IoRequest.java:105) ~[hyracks-control-nc.jar:5.5.0-2814]
|
at org.apache.hyracks.control.nc.io.IoRequest.read(IoRequest.java:76) ~[hyracks-control-nc.jar:5.5.0-2814]
|
at org.apache.hyracks.control.nc.io.IOManager.asyncRead(IOManager.java:317) ~[hyracks-control-nc.jar:5.5.0-2814]
|
at org.apache.hyracks.control.nc.io.IOManager.syncRead(IOManager.java:248) ~[hyracks-control-nc.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.common.buffercache.BufferCache.read(BufferCache.java:548) ~[hyracks-storage-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.common.buffercache.BufferCache.tryRead(BufferCache.java:519) ~[hyracks-storage-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.common.buffercache.BufferCache.pin(BufferCache.java:190) ~[hyracks-storage-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.btree.impls.DiskBTreeRangeSearchCursor.acquirePage(DiskBTreeRangeSearchCursor.java:98) ~[hyracks-storage-am-btree.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.btree.impls.BTreeRangeSearchCursor.fetchNextLeafPage(BTreeRangeSearchCursor.java:104) ~[hyracks-storage-am-btree.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.btree.impls.DiskBTreeRangeSearchCursor.doHasNext(DiskBTreeRangeSearchCursor.java:45) ~[hyracks-storage-am-btree.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.common.EnforcedIndexCursor.hasNext(EnforcedIndexCursor.java:69) ~[hyracks-storage-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.lsm.common.impls.LSMIndexSearchCursor.pushIntoQueueFromCursorAndReplaceThisElement(LSMIndexSearchCursor.java:189) ~[hyracks-storage-am-lsm-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTreeRangeSearchCursor.pushOutputElementIntoQueueIfNeeded(LSMBTreeRangeSearchCursor.java:198) ~[hyracks-storage-am-lsm-btree.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTreeRangeSearchCursor.checkPriorityQueue(LSMBTreeRangeSearchCursor.java:180) ~[hyracks-storage-am-lsm-btree.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.lsm.common.impls.LSMIndexSearchCursor.doHasNext(LSMIndexSearchCursor.java:139) ~[hyracks-storage-am-lsm-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.common.EnforcedIndexCursor.hasNext(EnforcedIndexCursor.java:69) ~[hyracks-storage-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTreeSearchCursor.doHasNext(LSMBTreeSearchCursor.java:60) ~[hyracks-storage-am-lsm-btree.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.common.EnforcedIndexCursor.hasNext(EnforcedIndexCursor.java:69) ~[hyracks-storage-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.common.dataflow.IndexSearchOperatorNodePushable.writeSearchResults(IndexSearchOperatorNodePushable.java:203) ~[hyracks-storage-am-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.common.dataflow.IndexSearchOperatorNodePushable.nextFrame(IndexSearchOperatorNodePushable.java:244) ~[hyracks-storage-am-common.jar:5.5.0-2814]
|
at org.apache.hyracks.storage.am.common.dataflow.IndexSearchOperatorNodePushable.nextFrame(IndexSearchOperatorNodePushable.java:244) ~[hyracks-storage-am-common.jar:5.5.0-2814]
|
at org.apache.hyracks.dataflow.common.comm.io.AbstractFrameAppender.write(AbstractFrameAppender.java:93) ~[hyracks-dataflow-common.jar:5.5.0-2814]
|
at org.apache.hyracks.algebricks.runtime.operators.std.EmptyTupleSourceRuntimeFactory$1.open(EmptyTupleSourceRuntimeFactory.java:55) ~[algebricks-runtime.jar:5.5.0-2814]
|
at org.apache.hyracks.algebricks.runtime.operators.meta.AlgebricksMetaOperatorDescriptor$SourcePushRuntime.initialize(AlgebricksMetaOperatorDescriptor.java:105) ~[algebricks-runtime.jar:5.5.0-2814]
|
at org.apache.hyracks.api.rewriter.runtime.SuperActivityOperatorNodePushable.lambda$runInParallel$0(SuperActivityOperatorNodePushable.java:208) ~[hyracks-api.jar:5.5.0-2814]
|
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_162]
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
|
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
|
Caused by: java.lang.InterruptedException
|
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220) ~[?:1.8.0_162]
|
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335) ~[?:1.8.0_162]
|
at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:350) ~[?:1.8.0_162]
|
at org.apache.hyracks.control.nc.io.IoRequest.queue(IoRequest.java:103) ~[hyracks-control-nc.jar:5.5.0-2814]
|
... 27 more
|
2018-06-03T09:16:49.639-07:00 INFO CBAS.dataset.DatasetPartitionWriter [Executor-82328:20b52be6564d5ba1fdc5a30027a20449] close(0)
|
2018-06-03T09:16:49.639-07:00 WARN CBAS.nc.Task [org.apache.hyracks.api.rewriter.runtime.SuperActivity:TAID:TID:ANID:ODID:3:0:0:0:0] Task TAID:TID:ANID:ODID:3:0:0:0 failed with exception
|
6. Again the doc count reach to 1.5M in all buckets and during that phase queries again started failing.
7. All buckets flushed again and queries are now passing.