Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51402

Deadlock in storage, causing stuck nodes followed by crash

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown
    • CX Sprint 283

    Description

      2022-03-10T09:25:59.751-08:00 WARN TestMethodTracer [main] ### test[ClusterConcurrentRebalanceIT 98: feeds: connect-meta-filtered-buckets] START
      2022-03-10T09:25:59.751-08:00 INFO TestExecutor [main] Starting [TEST]: feeds/connect-meta-filtered-buckets ... 
      2022-03-10T09:25:59.751-08:00 INFO TestExecutor [main] Executing test file: connect-meta-filtered-buckets.000.ddl.sqlpp
      2022-03-10T09:25:59.759-08:00 INFO ClusterIntegrationUtil [Rebalancer-22] Rebalancing from [0, 3] to [0, 1]
      2022-03-10T09:25:59.768-08:00 INFO ClusterIntegrationUtil [Rebalancer-22] Updated state to reflect cluster (active was: [0, 3] now: [0, 2, 3, 4]; failed was: [] now: [])
      2022-03-10T09:25:59.767-08:00 ERRO TestExecutor [main] testFile src/test/resources/runtimets/queries/feeds/connect-meta-filtered-buckets/connect-meta-filtered-buckets.000.ddl.sqlpp raised an unexpected exception
      java.lang.Exception: Analytics Service is temporarily unavailable
      

      on n_4:

      2022-03-10T09:21:40.488-08:00 WARN CBAS.management.ReplicationChannel [Replication Worker] Unexpected error during replication.
      org.apache.asterix.common.exceptions.ReplicationException: java.util.concurrent.TimeoutException: Couldn't receive flush lsn from master
      	at org.apache.asterix.replication.messaging.MarkComponentValidTask.ensureComponentLsnFlushed(MarkComponentValidTask.java:93) ~[asterix-replication-7.1.0-0000.jar:7.1.0-0000]
      	at org.apache.asterix.replication.messaging.MarkComponentValidTask.perform(MarkComponentValidTask.java:64) ~[asterix-replication-7.1.0-0000.jar:7.1.0-0000]
      	at org.apache.asterix.replication.management.ReplicationChannel$ReplicationWorker.handle(ReplicationChannel.java:157) ~[asterix-replication-7.1.0-0000.jar:7.1.0-0000]
      	at org.apache.asterix.replication.management.ReplicationChannel$ReplicationWorker.run(ReplicationChannel.java:134) [asterix-replication-7.1.0-0000.jar:7.1.0-0000]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
      	at java.lang.Thread.run(Thread.java:834) [?:?]
      Caused by: java.util.concurrent.TimeoutException: Couldn't receive flush lsn from master
      	... 7 more
      2022-03-10T09:22:10.494-08:00 WARN CBAS.management.ReplicationChannel [Replication Worker] Unexpected error during replication.
      org.apache.asterix.common.exceptions.ReplicationException: java.util.concurrent.TimeoutException: Couldn't receive flush lsn from master
      	at org.apache.asterix.replication.messaging.MarkComponentValidTask.ensureComponentLsnFlushed(MarkComponentValidTask.java:93) ~[asterix-replication-7.1.0-0000.jar:7.1.0-0000]
      	at org.apache.asterix.replication.messaging.MarkComponentValidTask.perform(MarkComponentValidTask.java:64) ~[asterix-replication-7.1.0-0000.jar:7.1.0-0000]
      	at org.apache.asterix.replication.management.ReplicationChannel$ReplicationWorker.handle(ReplicationChannel.java:157) ~[asterix-replication-7.1.0-0000.jar:7.1.0-0000]
      	at org.apache.asterix.replication.management.ReplicationChannel$ReplicationWorker.run(ReplicationChannel.java:134) [asterix-replication-7.1.0-0000.jar:7.1.0-0000]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
      	at java.lang.Thread.run(Thread.java:834) [?:?]
      Caused by: java.util.concurrent.TimeoutException: Couldn't receive flush lsn from master
      	... 7 more
      2022-03-10T09:23:56.322-08:00 FATA CBAS.bootstrap.DeadlockWatchdog [Timer-0] Detected 3 threads ([41, 201, 196]); thread dump: 
       
      "Executor-2:d649476aff580f0be92c36769da94f66" [tid=41 state=BLOCKED lock=org.apache.hyracks.storage.am.lsm.common.impls.IoOperationExecutor@354462df lockOwner="Executor-99:d649476aff580f0be92c36769da94f66" (tid=201]
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.AbstractAsynchronousScheduler.scheduleFlush(AbstractAsynchronousScheduler.java:94)
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.AbstractAsynchronousScheduler.scheduleOperation(AbstractAsynchronousScheduler.java:56)
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.scheduleFlush(LSMHarness.java:512)
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.scheduleFlush(LSMTreeIndexAccessor.java:139)
      	at app//org.apache.asterix.common.context.PrimaryIndexOperationTracker.triggerScheduleFlush(PrimaryIndexOperationTracker.java:226)
      	- <locked java.util.HashMap@612648c1>
      	- <locked org.apache.asterix.common.context.PrimaryIndexOperationTracker@650b9a9>
      	at app//org.apache.asterix.transaction.management.service.logging.LogBuffer.notifyFlushTermination(LogBuffer.java:293)
      	at app//org.apache.asterix.transaction.management.service.logging.LogBuffer.batchUnlock(LogBuffer.java:240)
      	at app//org.apache.asterix.transaction.management.service.logging.LogBuffer.internalFlush(LogBuffer.java:208)
      	at app//org.apache.asterix.transaction.management.service.logging.LogBuffer.flush(LogBuffer.java:181)
      	at app//org.apache.asterix.transaction.management.service.logging.LogFlusher.call(LogManager.java:665)
      	at app//org.apache.asterix.transaction.management.service.logging.LogFlusher.call(LogManager.java:620)
      	at java.base@11.0.9.1/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base@11.0.9.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base@11.0.9.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base@11.0.9.1/java.lang.Thread.run(Thread.java:834)
       
      	Locked synchronizers:
      	- java.util.concurrent.ThreadPoolExecutor$Worker@35c4697c
       
      "Executor-94:d649476aff580f0be92c36769da94f66" [tid=196 state=BLOCKED lock=java.util.HashMap@612648c1 lockOwner="Executor-2:d649476aff580f0be92c36769da94f66" (tid=41]
      	at app//org.apache.asterix.common.context.PrimaryIndexOperationTracker.completed(PrimaryIndexOperationTracker.java:240)
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.AbstractIoOperation.complete(AbstractIoOperation.java:146)
      	- <locked org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTreeFlushOperation@df7ee40>
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.IoOperationExecutor.doAfterExecute(IoOperationExecutor.java:77)
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.IoOperationExecutor.afterExecute(IoOperationExecutor.java:64)
      	at java.base@11.0.9.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1129)
      	at java.base@11.0.9.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base@11.0.9.1/java.lang.Thread.run(Thread.java:834)
       
      	Locked synchronizers:
      	- java.util.concurrent.ThreadPoolExecutor$Worker@8edc4b4
       
      "Executor-99:d649476aff580f0be92c36769da94f66" [tid=201 state=BLOCKED lock=org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTreeFlushOperation@df7ee40 lockOwner="Executor-94:d649476aff580f0be92c36769da94f66" (tid=196]
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.AbstractIoOperation.isCompleted(AbstractIoOperation.java:207)
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.AbstractAsynchronousScheduler.completeFlush(AbstractAsynchronousScheduler.java:142)
      	- <locked org.apache.hyracks.storage.am.lsm.common.impls.IoOperationExecutor@354462df>
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.AbstractAsynchronousScheduler.completeOperation(AbstractAsynchronousScheduler.java:74)
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.IoOperationExecutor.doAfterExecute(IoOperationExecutor.java:79)
      	at app//org.apache.hyracks.storage.am.lsm.common.impls.IoOperationExecutor.afterExecute(IoOperationExecutor.java:64)
      	at java.base@11.0.9.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1129)
      	at java.base@11.0.9.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base@11.0.9.1/java.lang.Thread.run(Thread.java:834)
       
      	Locked synchronizers:
      	- java.util.concurrent.ThreadPoolExecutor$Worker@471fe36b
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              umang.agrawal Umang
              michael.blow Michael Blow
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty