Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62931

[System Test] Queries failing with internal errors - cluster monitor] interrupted during pool stream

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • Columnar 1.0.0
    • Columnar 1.0.0
    • analytics
    • 1.0.0-2239
    • Untriaged
    • 0
    • Unknown
    • Analytics Sprint 47

    Description

      During the system test run, there was an automatic teardown of the remote cluster, but the Columnar cluster was still alive. After this, we keep seeing a bunch of errors and lots of queries have failed with internal errors.

      2024-07-28T18:46:57.308+00:00 WARN CBAS.bootstrap.ClusterMonitor [RecoveryTask (linkuTqJNHAv/default1)] Remote bucket map failed to bootstrap in 60s; restarting monitor
      2024-07-28T18:46:57.308+00:00 INFO CBAS.bootstrap.ClusterMonitor [linkuTqJNHAv cluster monitor] interrupted during pool stream; will retry
      java.lang.InterruptedException: sleep interrupted
      	at java.base/java.lang.Thread.sleep(Native Method) ~[?:?]
      	at java.base/java.lang.Thread.sleep(Thread.java:344) ~[?:?]
      	at java.base/java.util.concurrent.TimeUnit.sleep(TimeUnit.java:446) ~[?:?]
      	at com.couchbase.analytics.bootstrap.ClusterMonitor.startMonitor(ClusterMonitor.java:93) ~[columnar-server.jar:1.0.0-2239]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      2024-07-28T18:46:57.308+00:00 INFO CBAS.active.RecoveryTask [RecoveryTask (linkuTqJNHAv/default1)] Attempt to revive linkuTqJNHAv/default1 failed
      com.couchbase.analytics.common.exceptions.AnalyticsHyracksException: CBAS0079: Failed to connect link 'linkuTqJNHAv': HYR0091: Operation timed out
      	at com.couchbase.analytics.util.BucketValidationUtils.ensureKVBucket(BucketValidationUtils.java:44) ~[columnar-connector.jar:1.0.0-2239]
      	at com.couchbase.analytics.lang.ConnectLinkStatement.doConnect(ConnectLinkStatement.java:1122) ~[columnar-connector.jar:1.0.0-2239]
      	at com.couchbase.analytics.metadata.BucketEventsListener.doConnect(BucketEventsListener.java:484) ~[columnar-connector.jar:1.0.0-2239]
      	at com.couchbase.analytics.metadata.BucketEventsListener.compileAndStartJob(BucketEventsListener.java:470) ~[columnar-connector.jar:1.0.0-2239]
      	at org.apache.asterix.app.active.ActiveEntityEventsListener.doStart(ActiveEntityEventsListener.java:403) ~[asterix-app.jar:1.0.0-2239]
      	at org.apache.asterix.app.active.ActiveEntityEventsListener.doRecover(ActiveEntityEventsListener.java:430) ~[asterix-app.jar:1.0.0-2239]
      	at org.apache.asterix.app.active.RecoveryTask.doRecover(RecoveryTask.java:142) ~[asterix-app.jar:1.0.0-2239]
      	at org.apache.asterix.app.active.RecoveryTask.lambda$recover$1(RecoveryTask.java:70) ~[asterix-app.jar:1.0.0-2239]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      2024-07-28T18:46:57.308+00:00 INFO CBAS.metadata.RecoveryRetryPolicy [RecoveryTask (linkuTqJNHAv/default1)] will retry recovery (attempt 352) in 60s
      2024-07-28T18:47:07.279+00:00 INFO CBAS.bootstrap.ClusterMonitor [linkuTqJNHAv cluster monitor] interrupted during pool stream; will retry
      java.lang.InterruptedException: sleep interrupted
      	at java.base/java.lang.Thread.sleep(Native Method) ~[?:?]
      	at java.base/java.lang.Thread.sleep(Thread.java:344) ~[?:?]
      	at java.base/java.util.concurrent.TimeUnit.sleep(TimeUnit.java:446) ~[?:?]
      	at com.couchbase.analytics.bootstrap.ClusterMonitor.startMonitor(ClusterMonitor.java:93) ~[columnar-server.jar:1.0.0-2239]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      2024-07-28T18:47:22.279+00:00 INFO CBAS.bootstrap.ClusterMonitor [linkuTqJNHAv cluster monitor] interrupted during pool stream; will retry
      java.lang.InterruptedException: sleep interrupted
      	at java.base/java.lang.Thread.sleep(Native Method) ~[?:?]
      	at java.base/java.lang.Thread.sleep(Thread.java:344) ~[?:?]
      	at java.base/java.util.concurrent.TimeUnit.sleep(TimeUnit.java:446) ~[?:?]
      	at com.couchbase.analytics.bootstrap.ClusterMonitor.startMonitor(ClusterMonitor.java:93) ~[columnar-server.jar:1.0.0-2239]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      2024-07-28T18:47:37.279+00:00 INFO CBAS.bootstrap.ClusterMonitor [linkuTqJNHAv cluster monitor] interrupted during pool stream; will retry
      java.lang.InterruptedException: sleep interrupted
      	at java.base/java.lang.Thread.sleep(Native Method) ~[?:?]
      	at java.base/java.lang.Thread.sleep(Thread.java:344) ~[?:?]
      	at java.base/java.util.concurrent.TimeUnit.sleep(TimeUnit.java:446) ~[?:?]
      	at com.couchbase.analytics.bootstrap.ClusterMonitor.startMonitor(ClusterMonitor.java:93) ~[columnar-server.jar:1.0.0-2239]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      2024-07-28T18:47:52.279+00:00 INFO CBAS.bootstrap.ClusterMonitor [linkuTqJNHAv cluster monitor] interrupted during pool stream; will retry
      java.lang.InterruptedException: sleep interrupted
      	at java.base/java.lang.Thread.sleep(Native Method) ~[?:?]
      	at java.base/java.lang.Thread.sleep(Thread.java:344) ~[?:?]
      	at java.base/java.util.concurrent.TimeUnit.sleep(TimeUnit.java:446) ~[?:?]
      	at com.couchbase.analytics.bootstrap.ClusterMonitor.startMonitor(ClusterMonitor.java:93) ~[columnar-server.jar:1.0.0-2239]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      

      I disconnected the remote link to see if the test can be resumed. Then, the service seemed to have crashed with status 2 errors

      2024-07-29T02:51:38.931+00:00 INFO CBAS.messaging.NCMessageBroker [Executor-12:7a5b827b88d5f6077bc1a32e4c548fc1] Received message: TxnIdBlockResponse
      2024-07-29T02:51:38.940+00:00 ERRO CBAS.message.RegistrationTasksResponseMessage [Executor-8:7a5b827b88d5f6077bc1a32e4c548fc1] Failed during startup task
      org.apache.hyracks.api.exceptions.HyracksDataException: org.apache.asterix.common.exceptions.MetadataException: java.lang.IllegalStateException: attempt to create metadata index Database. Index should already exist
      	at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:49) ~[hyracks-api.jar:1.0.0-2239]
      	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:55) ~[asterix-app.jar:1.0.0-2239]
      	at org.apache.asterix.app.replication.message.RegistrationTasksResponseMessage.handle(RegistrationTasksResponseMessage.java:63) ~[asterix-app.jar:1.0.0-2239]
      	at org.apache.asterix.messaging.NCMessageBroker.lambda$receivedMessage$0(NCMessageBroker.java:108) ~[asterix-app.jar:1.0.0-2239]
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      Caused by: org.apache.asterix.common.exceptions.MetadataException: java.lang.IllegalStateException: attempt to create metadata index Database. Index should already exist
      	at org.apache.asterix.metadata.bootstrap.MetadataBootstrap.startUniverse(MetadataBootstrap.java:193) ~[asterix-metadata.jar:1.0.0-2239]
      	at org.apache.asterix.app.nc.NCAppRuntimeContext.initializeMetadata(NCAppRuntimeContext.java:539) ~[asterix-app.jar:1.0.0-2239]
      	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:50) ~[asterix-app.jar:1.0.0-2239]
      	... 7 more
      Caused by: java.lang.IllegalStateException: attempt to create metadata index Database. Index should already exist
      	at org.apache.asterix.metadata.bootstrap.MetadataBootstrap.ensureCatalogUpgradability(MetadataBootstrap.java:610) ~[asterix-metadata.jar:1.0.0-2239]
      	at org.apache.asterix.metadata.bootstrap.MetadataBootstrap.enlistMetadataDataset(MetadataBootstrap.java:443) ~[asterix-metadata.jar:1.0.0-2239]
      	at org.apache.asterix.metadata.bootstrap.MetadataBootstrap.startUniverse(MetadataBootstrap.java:155) ~[asterix-metadata.jar:1.0.0-2239]
      	at org.apache.asterix.app.nc.NCAppRuntimeContext.initializeMetadata(NCAppRuntimeContext.java:539) ~[asterix-app.jar:1.0.0-2239]
      	at org.apache.asterix.app.nc.task.MetadataBootstrapTask.perform(MetadataBootstrapTask.java:50) ~[asterix-app.jar:1.0.0-2239]
      	... 7 more
      2024-07-29T02:51:38.948+00:00 INFO CBAS.util.ExitUtil [ShutdownWatchdog] starting shutdown watchdog- system will halt if shutdown is not completed within 600 seconds
      2024-07-29T02:51:38.948+00:00 WARN CBAS.util.ExitUtil [JVM exit thread] JVM exiting with status 2; bye!
      java.lang.Throwable: exit callstack
      	at org.apache.hyracks.util.ExitUtil.exit(ExitUtil.java:92) ~[hyracks-util.jar:1.0.0-2239]
      	at org.apache.asterix.app.replication.message.RegistrationTasksResponseMessage.handle(RegistrationTasksResponseMessage.java:90) ~[asterix-app.jar:1.0.0-2239]
      	at org.apache.asterix.messaging.NCMessageBroker.lambda$receivedMessage$0(NCMessageBroker.java:108) ~[asterix-app.jar:1.0.0-2239]
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) ~[?:?]
      2024-07-29T02:51:38.949+00:00 INFO CBAS.messaging.CCMessageBroker [Executor-10:ClusterController] Received message: 
      

      cbcollect ->

      https://cb-engineering.s3.amazonaws.com/SysTestColumnarRC3Half/collectinfo-2024-07-29T033732-ns_1%40svc-da-node-014.oinjtxrxhmvsl0g5.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarRC3Half/collectinfo-2024-07-29T033732-ns_1%40svc-da-node-020.oinjtxrxhmvsl0g5.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarRC3Half/collectinfo-2024-07-29T033732-ns_1%40svc-da-node-022.oinjtxrxhmvsl0g5.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarRC3Half/collectinfo-2024-07-29T033732-ns_1%40svc-da-node-028.oinjtxrxhmvsl0g5.sandbox.nonprod-project-avengers.com.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              pavan.pb Pavan PB
              pavan.pb Pavan PB
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty