Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50959

[System Test] Rebalance stuck after recovering and adding back a failed over analytics node

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • analytics
    • Enterprise Edition 7.1.0 build 2298

    Description

      Test Steps -

      1. Create a 6 node cluster with 3 KV and 3 cbas nodes.
      2. Create buckets, scopes and collections.
      3. Create dataverses, datasets and indexes and set analytics replica to 3.
      4. Start data load into the KV buckets.
      5. Failover one of the CBAS nodes.
      6. Start analytics query and wait for them to finish (takes almost 3 hours, thousands of queries are run)
      7. Bring back the failed over node.
      8. Rebalance.

      The rebalance was stuck for almost 6 hours during the 2nd iteration of the test.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          MB-50959:
          The rebalance was stuck waiting on an ingestion job JID:2.8736 to finish. The job acutally had failed due to:

          2022-02-13T16:01:37.486-08:00 WARN CBAS.operators.FeedMetaStoreNodePushable [SA:JID:2.8736:TAID:TID:ANID:ODID:1:0:18:0:0] Failed to open feed store operator
          org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.NullPointerException
          	at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:49) ~[hyracks-api-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.asterix.runtime.operators.LSMPrimaryUpsertOperatorNodePushable.open(LSMPrimaryUpsertOperatorNodePushable.java:306) ~[asterix-runtime-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.asterix.external.operators.FeedMetaStoreNodePushable.open(FeedMetaStoreNodePushable.java:138) [asterix-external-data-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.hyracks.algebricks.runtime.operators.base.AbstractOneInputPushRuntime.open(AbstractOneInputPushRuntime.java:41) [algebricks-runtime-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.hyracks.algebricks.runtime.operators.std.StreamSelectRuntimeFactory$StreamSelectRuntime.open(StreamSelectRuntimeFactory.java:124) [algebricks-runtime-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.hyracks.algebricks.runtime.operators.meta.AlgebricksMetaOperatorDescriptor$1.open(AlgebricksMetaOperatorDescriptor.java:150) [algebricks-runtime-7.1.0-2298.jar:7.1.0-2298]
          	at com.couchbase.analytics.runtime.DcpRouteOperatorDescriptor$DcpRouteOperatorNodePushable.open(DcpRouteOperatorDescriptor.java:171) [cbas-connector-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.hyracks.control.nc.Task.pushFrames(Task.java:414) [hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.hyracks.control.nc.Task.run(Task.java:354) [hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298]
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
          	at java.lang.Thread.run(Thread.java:829) [?:?]
          Caused by: java.lang.NullPointerException
          	at org.apache.asterix.common.context.BaseOperationTracker.beforeTransaction(BaseOperationTracker.java:70) ~[asterix-common-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.asterix.transaction.management.service.transaction.AbstractTransactionContext.register(AbstractTransactionContext.java:128) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.asterix.transaction.management.service.transaction.EntityLevelTransactionContext.register(EntityLevelTransactionContext.java:53) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298]
          	at com.couchbase.analytics.runtime.ShadowModificationOperationCallbackFactory.createModificationOperationCallback(ShadowModificationOperationCallbackFactory.java:55) ~[cbas-connector-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.asterix.runtime.operators.LSMPrimaryUpsertOperatorNodePushable.open(LSMPrimaryUpsertOperatorNodePushable.java:266) ~[asterix-runtime-7.1.0-2298.jar:7.1.0-2298]
          	... 10 more
          

          But we failed to report the failure to the CC due to:

          2022-02-13T16:01:37.590-08:00 WARN CBAS.work.WorkQueue [Worker:06b455985ebffbb0feca29f767bbbc9b] Exception while executing NotifyTaskFailure: [06b455985ebffbb0feca29f767bbbc9b[JID:2.8736:TAID:TID:ANID:ODID:1:0:18:0]
          java.lang.NullPointerException: null
          	at org.apache.asterix.common.context.BaseOperationTracker.afterTransaction(BaseOperationTracker.java:76) ~[asterix-common-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.asterix.transaction.management.service.transaction.AbstractTransactionContext.lambda$complete$0(AbstractTransactionContext.java:115) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298]
          	at java.util.HashMap.forEach(HashMap.java:1337) ~[?:?]
          	at org.apache.asterix.transaction.management.service.transaction.AbstractTransactionContext.complete(AbstractTransactionContext.java:115) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.asterix.transaction.management.service.transaction.TransactionManager.abortTransaction(TransactionManager.java:122) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.asterix.runtime.job.listener.MultiTransactionJobletEventListenerFactory$1.jobletFinish(MultiTransactionJobletEventListenerFactory.java:82) ~[asterix-runtime-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.hyracks.control.nc.Joblet.performCleanup(Joblet.java:350) ~[hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.hyracks.control.nc.Joblet.removeTask(Joblet.java:180) ~[hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.hyracks.control.nc.work.NotifyTaskFailureWork.run(NotifyTaskFailureWork.java:66) ~[hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298]
          	at org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127) [hyracks-control-common-7.1.0-2298.jar:7.1.0-2298]
          

          We fixed the reporting to the CC issue and now we should try to reconnect the ingestion job again. However, we still don't know why the ingestion job itself failed and unfortunately the debug logs that might've helped rolled over. I added new logs in case this issue isn't fixed on the next reconnect attempt to help us trace it.

          murtadha.hubail Murtadha Hubail added a comment - MB-50959 : The rebalance was stuck waiting on an ingestion job JID:2.8736 to finish. The job acutally had failed due to: 2022-02-13T16:01:37.486-08:00 WARN CBAS.operators.FeedMetaStoreNodePushable [SA:JID:2.8736:TAID:TID:ANID:ODID:1:0:18:0:0] Failed to open feed store operator org.apache.hyracks.api.exceptions.HyracksDataException: java.lang.NullPointerException at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:49) ~[hyracks-api-7.1.0-2298.jar:7.1.0-2298] at org.apache.asterix.runtime.operators.LSMPrimaryUpsertOperatorNodePushable.open(LSMPrimaryUpsertOperatorNodePushable.java:306) ~[asterix-runtime-7.1.0-2298.jar:7.1.0-2298] at org.apache.asterix.external.operators.FeedMetaStoreNodePushable.open(FeedMetaStoreNodePushable.java:138) [asterix-external-data-7.1.0-2298.jar:7.1.0-2298] at org.apache.hyracks.algebricks.runtime.operators.base.AbstractOneInputPushRuntime.open(AbstractOneInputPushRuntime.java:41) [algebricks-runtime-7.1.0-2298.jar:7.1.0-2298] at org.apache.hyracks.algebricks.runtime.operators.std.StreamSelectRuntimeFactory$StreamSelectRuntime.open(StreamSelectRuntimeFactory.java:124) [algebricks-runtime-7.1.0-2298.jar:7.1.0-2298] at org.apache.hyracks.algebricks.runtime.operators.meta.AlgebricksMetaOperatorDescriptor$1.open(AlgebricksMetaOperatorDescriptor.java:150) [algebricks-runtime-7.1.0-2298.jar:7.1.0-2298] at com.couchbase.analytics.runtime.DcpRouteOperatorDescriptor$DcpRouteOperatorNodePushable.open(DcpRouteOperatorDescriptor.java:171) [cbas-connector-7.1.0-2298.jar:7.1.0-2298] at org.apache.hyracks.control.nc.Task.pushFrames(Task.java:414) [hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298] at org.apache.hyracks.control.nc.Task.run(Task.java:354) [hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: java.lang.NullPointerException at org.apache.asterix.common.context.BaseOperationTracker.beforeTransaction(BaseOperationTracker.java:70) ~[asterix-common-7.1.0-2298.jar:7.1.0-2298] at org.apache.asterix.transaction.management.service.transaction.AbstractTransactionContext.register(AbstractTransactionContext.java:128) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298] at org.apache.asterix.transaction.management.service.transaction.EntityLevelTransactionContext.register(EntityLevelTransactionContext.java:53) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298] at com.couchbase.analytics.runtime.ShadowModificationOperationCallbackFactory.createModificationOperationCallback(ShadowModificationOperationCallbackFactory.java:55) ~[cbas-connector-7.1.0-2298.jar:7.1.0-2298] at org.apache.asterix.runtime.operators.LSMPrimaryUpsertOperatorNodePushable.open(LSMPrimaryUpsertOperatorNodePushable.java:266) ~[asterix-runtime-7.1.0-2298.jar:7.1.0-2298] ... 10 more But we failed to report the failure to the CC due to: 2022-02-13T16:01:37.590-08:00 WARN CBAS.work.WorkQueue [Worker:06b455985ebffbb0feca29f767bbbc9b] Exception while executing NotifyTaskFailure: [06b455985ebffbb0feca29f767bbbc9b[JID:2.8736:TAID:TID:ANID:ODID:1:0:18:0] java.lang.NullPointerException: null at org.apache.asterix.common.context.BaseOperationTracker.afterTransaction(BaseOperationTracker.java:76) ~[asterix-common-7.1.0-2298.jar:7.1.0-2298] at org.apache.asterix.transaction.management.service.transaction.AbstractTransactionContext.lambda$complete$0(AbstractTransactionContext.java:115) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298] at java.util.HashMap.forEach(HashMap.java:1337) ~[?:?] at org.apache.asterix.transaction.management.service.transaction.AbstractTransactionContext.complete(AbstractTransactionContext.java:115) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298] at org.apache.asterix.transaction.management.service.transaction.TransactionManager.abortTransaction(TransactionManager.java:122) ~[asterix-transactions-7.1.0-2298.jar:7.1.0-2298] at org.apache.asterix.runtime.job.listener.MultiTransactionJobletEventListenerFactory$1.jobletFinish(MultiTransactionJobletEventListenerFactory.java:82) ~[asterix-runtime-7.1.0-2298.jar:7.1.0-2298] at org.apache.hyracks.control.nc.Joblet.performCleanup(Joblet.java:350) ~[hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298] at org.apache.hyracks.control.nc.Joblet.removeTask(Joblet.java:180) ~[hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298] at org.apache.hyracks.control.nc.work.NotifyTaskFailureWork.run(NotifyTaskFailureWork.java:66) ~[hyracks-control-nc-7.1.0-2298.jar:7.1.0-2298] at org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127) [hyracks-control-common-7.1.0-2298.jar:7.1.0-2298] We fixed the reporting to the CC issue and now we should try to reconnect the ingestion job again. However, we still don't know why the ingestion job itself failed and unfortunately the debug logs that might've helped rolled over. I added new logs in case this issue isn't fixed on the next reconnect attempt to help us trace it.
          umang.agrawal Umang added a comment -

          Verified with 7.1.0-2506

          umang.agrawal Umang added a comment - Verified with 7.1.0-2506

          People

            umang.agrawal Umang
            umang.agrawal Umang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty