Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31355

[System Test] One Analytics node not responding. Has "Failed during startup task" errors

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Test Blocker
    • Resolution: Fixed
    • 6.0.0
    • 6.0.0
    • analytics
    • centos1 cluster

    Description

      Build : 6.0.0-1643
      Test: -test tests/integration/test_allFeatures_alice_timers.yml -scope tests/integration/scope_Xattrs_Alice.yml
      Scale: 2
      Iteration: 5th

      One Analytics node - 172.23.96.56 - is not responding on port 8095. The analytics logs is filled up with errors like the following.

      2018-09-19T21:28:34.361-07:00 ERRO CBAS.lifecycle.LifeCycleComponentManager [ShutdownHook-d99d0d419ebdc3e457294855c852a996] Lifecycle management not configured org.apache.hyracks.api.lifecycle.LifeCycleComponentManager@23a873ff
      2018-09-19T21:28:47.553-07:00 INFO CBAS.bootstrap.AnalyticsNCApplication [main] Starting Couchbase Analytics driver, version: {"build.date":"Fri Sep 14 13:56:39 PDT 2018","build.number":"1643","build.version":"6.0.0-1643"}
      2018-09-19T21:28:48.581-07:00 ERRO CBAS.message.RegistrationTasksResponseMessage [Executor-10:d99d0d419ebdc3e457294855c852a996] Failed during startup task
      org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]
      	at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:45) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.replication.message.RegistrationTasksResponseMessage.handle(RegistrationTasksResponseMessage.java:63) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.messaging.NCMessageBroker.lambda$receivedMessage$0(NCMessageBroker.java:100) ~[asterix-app.jar:6.0.0-1643]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
      	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
      Caused by: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]
      	at org.apache.asterix.app.nc.RecoveryManager.redoFlush(RecoveryManager.java:839) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:412) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:185) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.RecoveryManager.replayReplicaPartitionLogs(RecoveryManager.java:529) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.ReplicaManager.promote(ReplicaManager.java:120) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:41) ~[asterix-app.jar:6.0.0-1643]
      	... 7 more
      2018-09-19T21:28:48.591-07:00 ERRO CBAS.replication.NcLifecycleCoordinator [Executor-2:ClusterController] d99d0d419ebdc3e457294855c852a996 failed to complete startup. 
      org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]
      	at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:51) ~[hyracks-api.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:45) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.replication.message.RegistrationTasksResponseMessage.handle(RegistrationTasksResponseMessage.java:63) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.messaging.NCMessageBroker.lambda$receivedMessage$0(NCMessageBroker.java:100) ~[asterix-app.jar:6.0.0-1643]
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_181]
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_181]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
      	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
      Caused by: java.lang.IllegalStateException: Illegal state of component Id. Max disk component Id [1,8] should be less than redo flush component Id [1,1]
      	at org.apache.asterix.app.nc.RecoveryManager.redoFlush(RecoveryManager.java:839) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:412) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:185) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.RecoveryManager.replayReplicaPartitionLogs(RecoveryManager.java:529) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.ReplicaManager.promote(ReplicaManager.java:120) ~[asterix-app.jar:6.0.0-1643]
      	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:41) ~[asterix-app.jar:6.0.0-1643]
      	... 7 more
      2018-09-19T21:28:48.653-07:00 INFO CBAS.bootstrap.AnalyticsNCApplication [ShutdownHook-d99d0d419ebdc3e457294855c852a996] Stopping Couchbase Analytics driver
      2018-09-19T21:28:50.828-07:00 ERRO CBAS.lifecycle.LifeCycleComponentManager [ShutdownHook-d99d0d419ebdc3e457294855c852a996] Lifecycle management not configured org.apache.hyracks.api.lifecycle.LifeCycleComponentManager@68d04fa4
      

      As a result, rebalances are failing.

      [ns_server:error,2018-09-19T18:30:24.224-07:00,ns_1@172.23.104.61:service_rebalancer-cbas<0.22731.2967>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:
      [{'ns_1@172.23.108.104',nack}]
      [user:error,2018-09-19T18:30:24.225-07:00,ns_1@172.23.104.61:<0.1230.623>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,cbas,
                                    {rebalance_failed,
                                     {service_error,
                                      <<"Rebalance 7aedd5386e71f7a233ddde48352405ca failed: see analytics.log for details">>}}}
      [ns_server:error,2018-09-19T19:01:19.041-07:00,ns_1@172.23.104.61:service_rebalancer-cbas<0.30457.2983>:service_agent:process_bad_results:810]Service call unset_rebalancer (service cbas) failed on some nodes:
      [{'ns_1@172.23.108.104',nack}]
      [user:error,2018-09-19T19:01:19.041-07:00,ns_1@172.23.104.61:<0.1230.623>:ns_orchestrator:do_log_rebalance_completion:1117]Rebalance exited with reason {service_rebalance_failed,cbas,
                                    {rebalance_failed,
                                     {service_error,
                                      <<"Rebalance be8cd89617ddf71e96f79f24674735d7 failed: see analytics.log for details">>}}}
      

      And also analytics queries and DDL statements are erroring out. Marking this as a blocker since we are blocked on the longevity test.

      Is this a regression : Not sure. This test never ran for 4+ days, hence we never saw this issue.

      Looks like this all started on 172.23.96.56 after another analytics node 172.23.96.148 was rebalanced out at 2018-09-19T14:47:14.

      2018-09-19T14:47:14.702-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [/0.0.0.0:9117]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:48310 Local Address: /0.0.0.0:9117]
      org.apache.hyracks.api.exceptions.NetException: Socket Closed
              at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]
              at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]
              at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]
      2018-09-19T14:47:14.702-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [null]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:9117 Local Address: null]
      org.apache.hyracks.api.exceptions.NetException: Socket Closed
              at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]
              at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]
              at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]
      2018-09-19T14:47:14.702-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [/0.0.0.0:9116]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:33884 Local Address: /0.0.0.0:9116]
      org.apache.hyracks.api.exceptions.NetException: Socket Closed
              at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]
              at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]
              at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]
      2018-09-19T14:47:14.703-07:00 ERRO CBAS.tcp.TCPEndpoint [TCPEndpoint IO Thread [/0.0.0.0:9116]] Unexpected tcp io error in connection TCPConnection[Remote Address: /172.23.96.148:9116 Local Address: /0.0.0.0:9116]
      org.apache.hyracks.api.exceptions.NetException: Socket Closed
              at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.driveReaderStateMachine(MultiplexedConnection.java:342) ~[hyracks-net.jar:6.0.0-1643]
              at org.apache.hyracks.net.protocols.muxdemux.MultiplexedConnection.notifyIOReady(MultiplexedConnection.java:113) ~[hyracks-net.jar:6.0.0-1643]
              at org.apache.hyracks.net.protocols.tcp.TCPEndpoint$IOThread.run(TCPEndpoint.java:186) [hyracks-net.jar:6.0.0-1643]
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            We have already identified the problem and have a regression test. This fix should be ready today. 

            murtadha.hubail Murtadha Hubail added a comment - We have already identified the problem and have a regression test. This fix should be ready today. 

            Murtadha Hubail What is the build for the fix ?

            vikas.chaudhary Vikas Chaudhary added a comment - Murtadha Hubail What is the build for the fix ?

            It’s 1654

            murtadha.hubail Murtadha Hubail added a comment - It’s 1654

            Verified on 6.0.0-1661. Longevity system test completed 6 iterations successfully, this issue didnt surface.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Verified on 6.0.0-1661. Longevity system test completed 6 iterations successfully, this issue didnt surface.

            People

              mihir.kamdar Mihir Kamdar (Inactive)
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty