Details
-
Bug
-
Resolution: Fixed
-
Test Blocker
-
6.0.0
-
centos1 cluster
-
Untriaged
-
-
Unknown
-
CX Sprint 120
Description
Build : 6.0.0-1643
Test: -test tests/integration/test_allFeatures_alice_timers.yml -scope tests/integration/scope_Xattrs_Alice.yml
Scale: 2
Iteration: 5th
One Analytics node - 172.23.96.56 - is not responding on port 8095. The analytics logs is filled up with errors like the following.
2018-09-19T21:28:34.361-07:00 ERRO CBAS.lifecycle.LifeCycleComponentManager [ShutdownHook-d99d0d419ebdc3e457294855c852a996] Lifecycle management not configured org.apache.hyracks.api.lifecycle.LifeCycleComponentManager@23a873ff
|
2018-09-19T21:28:47.553-07:00 INFO CBAS.bootstrap.AnalyticsNCApplication [main] Starting Couchbase Analytics driver, version: {"build.date":"Fri Sep 14 13:56:39 PDT 2018","build.number":"1643","build.version":"6.0.0-1643"}
|
2018-09-19T21:28:48.581-07:00 ERRO CBAS.message.RegistrationTasksResponseMessage [Executor-10:d99d0d419ebdc3e457294855c852a996] Failed during startup task
|
org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]
|
at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:45) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.replication.message.RegistrationTasksResponseMessage.handle(RegistrationTasksResponseMessage.java:63) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.messaging.NCMessageBroker.lambda$receivedMessage$0(NCMessageBroker.java:100) ~[asterix-app.jar:6.0.0-1643]
|
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
|
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
|
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
|
Caused by: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]
|
at org.apache.asterix.app.nc.RecoveryManager.redoFlush(RecoveryManager.java:839) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:412) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:185) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.RecoveryManager.replayReplicaPartitionLogs(RecoveryManager.java:529) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.ReplicaManager.promote(ReplicaManager.java:120) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:41) ~[asterix-app.jar:6.0.0-1643]
|
... 7 more
|
2018-09-19T21:28:48.591-07:00 ERRO CBAS.replication.NcLifecycleCoordinator [Executor-2:ClusterController] d99d0d419ebdc3e457294855c852a996 failed to complete startup.
|
org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]
|
at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:45) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.replication.message.RegistrationTasksResponseMessage.handle(RegistrationTasksResponseMessage.java:63) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.messaging.NCMessageBroker.lambda$receivedMessage$0(NCMessageBroker.java:100) ~[asterix-app.jar:6.0.0-1643]
|
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_181]
|
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_181]
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
|
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
|
Caused by: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]
|
at org.apache.asterix.app.nc.RecoveryManager.redoFlush(RecoveryManager.java:839) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:412) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:185) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.RecoveryManager.replayReplicaPartitionLogs(RecoveryManager.java:529) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.ReplicaManager.promote(ReplicaManager.java:120) ~[asterix-app.jar:6.0.0-1643]
|
at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:41) ~[asterix-app.jar:6.0.0-1643]
|
... 7 more
|
2018-09-19T21:28:48.653-07:00 INFO CBAS.bootstrap.AnalyticsNCApplication [ShutdownHook-d99d0d419ebdc3e457294855c852a996] Stopping Couchbase Analytics driver
|
2018-09-19T21:28:50.828-07:00 ERRO CBAS.lifecycle.LifeCycleComponentManager [ShutdownHook-d99d0d419ebdc3e457294855c852a996] Lifecycle management not configured org.apache.hyracks.api.lifecycle.LifeCycleComponentManager@68d04fa4
|
As a result, rebalances are failing.
[ns_server:error,2018-09-19T18:30:24.224-07:00,ns_1@172.23.104.61:service_rebalancer-cbas<0.22731.2967>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:
|
[{'ns_1@172.23.108.104',nack}]
|
[user:error,2018-09-19T18:30:24.225-07:00,ns_1@172.23.104.61:<0.1230.623>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,cbas,
|
{rebalance_failed,
|
{service_error,
|
<<"Rebalance 7aedd5386e71f7a233ddde48352405ca failed: see analytics.log for details">>}}}
|
[ns_server:error,2018-09-19T19:01:19.041-07:00,ns_1@172.23.104.61:service_rebalancer-cbas<0.30457.2983>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:
|
[{'ns_1@172.23.108.104',nack}]
|
[user:error,2018-09-19T19:01:19.041-07:00,ns_1@172.23.104.61:<0.1230.623>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,cbas,
|
{rebalance_failed,
|
{service_error,
|
<<"Rebalance be8cd89617ddf71e96f79f24674735d7 failed: see analytics.log for details">>}}}
|
And also analytics queries and DDL statements are erroring out. Marking this as a blocker since we are blocked on the longevity test.
Is this a regression : Not sure. This test never ran for 4+ days, hence we never saw this issue.
Looks like this all started on 172.23.96.56 after another analytics node 172.23.96.148 was rebalanced out at 2018-09-19T14:47:14.
2018-09-19T14:47:14.702-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [/0.0.0.0:9117]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:48310 Local Address: /0.0.0.0:9117]
|
org.apache.hyracks.api.exceptions.NetException: Socket Closed
|
at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]
|
at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]
|
at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]
|
2018-09-19T14:47:14.702-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [null]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:9117 Local Address: null]
|
org.apache.hyracks.api.exceptions.NetException: Socket Closed
|
at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]
|
at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]
|
at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]
|
2018-09-19T14:47:14.702-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [/0.0.0.0:9116]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:33884 Local Address: /0.0.0.0:9116]
|
org.apache.hyracks.api.exceptions.NetException: Socket Closed
|
at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]
|
at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]
|
at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]
|
2018-09-19T14:47:14.703-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [/0.0.0.0:9116]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:9116 Local Address: /0.0.0.0:9116]
|
org.apache.hyracks.api.exceptions.NetException: Socket Closed
|
at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]
|
at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]
|
at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]
|
Attachments
Issue Links
- links to