Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Test Blocker
Fix Version/s: 6.0.0
Affects Version/s: 6.0.0
Component/s: analytics
Labels:
- system-test
- triaged
Environment:
centos1 cluster

Triage:
Untriaged
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.104.61.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.104.67.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.104.69.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.104.70.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.106.188.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.108.103.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.108.104.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.145.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.148.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.168.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.252.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.253.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.56.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.95.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.97.238.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.97.239.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.98.135.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.99.11.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.99.21.zip
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.99.25.zip

Show
https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.104.61.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.104.67.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.104.69.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.104.70.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.106.188.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.108.103.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.108.104.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.145.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.148.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.168.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.252.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.253.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.56.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.96.95.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.97.238.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.97.239.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.98.135.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.99.11.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.99.21.zip https://s3.amazonaws.com/bugdb/jira/systestmon-1537416972/collectinfo-2018-09-20T041533-ns_1%40172.23.99.25.zip
Is this a Regression?:
Unknown
Sprint:
CX Sprint 120

Description

Build : 6.0.0-1643
Test: -test tests/integration/test_allFeatures_alice_timers.yml -scope tests/integration/scope_Xattrs_Alice.yml
Scale: 2
Iteration: 5th

One Analytics node - 172.23.96.56 - is not responding on port 8095. The analytics logs is filled up with errors like the following.

2018-09-19T21:28:34.361-07:00 ERRO CBAS.lifecycle.LifeCycleComponentManager [ShutdownHook-d99d0d419ebdc3e457294855c852a996] Lifecycle management not configured org.apache.hyracks.api.lifecycle.LifeCycleComponentManager@23a873ff

2018-09-19T21:28:47.553-07:00 INFO CBAS.bootstrap.AnalyticsNCApplication [main] Starting Couchbase Analytics driver, version: {"build.date":"Fri Sep 14 13:56:39 PDT 2018","build.number":"1643","build.version":"6.0.0-1643"}

2018-09-19T21:28:48.581-07:00 ERRO CBAS.message.RegistrationTasksResponseMessage [Executor-10:d99d0d419ebdc3e457294855c852a996] Failed during startup task

org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]

	at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:45) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.replication.message.RegistrationTasksResponseMessage.handle(RegistrationTasksResponseMessage.java:63) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.messaging.NCMessageBroker.lambda$receivedMessage$0(NCMessageBroker.java:100) ~[asterix-app.jar:6.0.0-1643]

	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]

	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]

	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]

	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]

	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

Caused by: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]

	at org.apache.asterix.app.nc.RecoveryManager.redoFlush(RecoveryManager.java:839) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:412) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:185) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.RecoveryManager.replayReplicaPartitionLogs(RecoveryManager.java:529) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.ReplicaManager.promote(ReplicaManager.java:120) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:41) ~[asterix-app.jar:6.0.0-1643]

	... 7 more

2018-09-19T21:28:48.591-07:00 ERRO CBAS.replication.NcLifecycleCoordinator [Executor-2:ClusterController] d99d0d419ebdc3e457294855c852a996 failed to complete startup.

org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]

	at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:45) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.replication.message.RegistrationTasksResponseMessage.handle(RegistrationTasksResponseMessage.java:63) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.messaging.NCMessageBroker.lambda$receivedMessage$0(NCMessageBroker.java:100) ~[asterix-app.jar:6.0.0-1643]

	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_181]

	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_181]

	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]

	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]

	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

Caused by: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]

	at org.apache.asterix.app.nc.RecoveryManager.redoFlush(RecoveryManager.java:839) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:412) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:185) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.RecoveryManager.replayReplicaPartitionLogs(RecoveryManager.java:529) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.ReplicaManager.promote(ReplicaManager.java:120) ~[asterix-app.jar:6.0.0-1643]

	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:41) ~[asterix-app.jar:6.0.0-1643]

	... 7 more

2018-09-19T21:28:48.653-07:00 INFO CBAS.bootstrap.AnalyticsNCApplication [ShutdownHook-d99d0d419ebdc3e457294855c852a996] Stopping Couchbase Analytics driver

2018-09-19T21:28:50.828-07:00 ERRO CBAS.lifecycle.LifeCycleComponentManager [ShutdownHook-d99d0d419ebdc3e457294855c852a996] Lifecycle management not configured org.apache.hyracks.api.lifecycle.LifeCycleComponentManager@68d04fa4

As a result, rebalances are failing.

[ns_server:error,2018-09-19T18:30:24.224-07:00,ns_1@172.23.104.61:service_rebalancer-cbas<0.22731.2967>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:

[{'ns_1@172.23.108.104',nack}]

[user:error,2018-09-19T18:30:24.225-07:00,ns_1@172.23.104.61:<0.1230.623>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,cbas,

                              {rebalance_failed,

                               {service_error,

                                <<"Rebalance 7aedd5386e71f7a233ddde48352405ca failed: see analytics.log for details">>}}}

[ns_server:error,2018-09-19T19:01:19.041-07:00,ns_1@172.23.104.61:service_rebalancer-cbas<0.30457.2983>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:

[{'ns_1@172.23.108.104',nack}]

[user:error,2018-09-19T19:01:19.041-07:00,ns_1@172.23.104.61:<0.1230.623>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,cbas,

                              {rebalance_failed,

                               {service_error,

                                <<"Rebalance be8cd89617ddf71e96f79f24674735d7 failed: see analytics.log for details">>}}}

And also analytics queries and DDL statements are erroring out. Marking this as a blocker since we are blocked on the longevity test.

Is this a regression : Not sure. This test never ran for 4+ days, hence we never saw this issue.

Looks like this all started on 172.23.96.56 after another analytics node 172.23.96.148 was rebalanced out at 2018-09-19T14:47:14.

2018-09-19T14:47:14.702-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [/0.0.0.0:9117]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:48310 Local Address: /0.0.0.0:9117]

org.apache.hyracks.api.exceptions.NetException: Socket Closed

        at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]

        at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]

        at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]

2018-09-19T14:47:14.702-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [null]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:9117 Local Address: null]

org.apache.hyracks.api.exceptions.NetException: Socket Closed

        at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]

        at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]

        at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]

2018-09-19T14:47:14.702-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [/0.0.0.0:9116]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:33884 Local Address: /0.0.0.0:9116]

org.apache.hyracks.api.exceptions.NetException: Socket Closed

        at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]

        at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]

        at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]

2018-09-19T14:47:14.703-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [/0.0.0.0:9116]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:9116 Local Address: /0.0.0.0:9116]

org.apache.hyracks.api.exceptions.NetException: Socket Closed

        at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]

        at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]

        at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]

Attachments

Issue Links

links to

Fix commit

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

[System Test] One Analytics node not responding. Has "Failed during startup task" errors

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty