Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62934

[60TB, RC3]: Rebl is hung since 15+ hours while scaling the cluster from 16 to 8 nodes. Cluster is UNUSABLE and service unavailable

    XMLWordPrintable

Details

    Description

      1. Create a 32 node columnar cluster. Ingest 1B items per remote collection in 20 collections.
      2. 20TB in columnar. Disconnect previous link and create new link and 20 more collections.
      3. Start scaling operations from 32 -> 16 -> 8 -> 4 -> 2 -> 4 -> 8 -> 16 -> 32
      4. 40TB in columnar. Disconnect previous link and create new link and 20 more collections.
      5. Start scaling operations from 32 -> 16 -> 8 -> 4 -> 2 -> 4 -> 8 -> 16 -> 32
      6. 60TB in columnar. Disconnect previous link and create new link and 20 more collections.
      7. Start scaling operations from 32 -> 16 -> 8 -> 4 -> 2 -> 4 -> 8 -> 16 -> 32

      2024-07-28T20:57:43.554+00:00 INFO CBAS.messaging.CCMessageBroker [Executor-3044:ClusterController] Received message: NCLifecycleTaskReportMessage{nodeId='4e1749497407faef3f4fcf70b46eed90', success=false, exception=java.lang.IllegalStateException: Couldn't find any checkpoints for resource: /var/cb-cache/@analytics/v_iodevice_1/storage/partition_73/Default/Default/remote_3gJ1l_volCollection_10_eenxt/0/remote_3gJ1l_volCollection_10_eenxt, localCounters=null, activePartitions=[64, 66,68, 70, 72, 73, 74, 75, 46, 47, 50, 52, 56, 58, 62, 63]}2024-07-28T20:57:43.554+00:00 ERRO CBAS.replication.NcLifecycleCoordinator [Executor-3044:ClusterController] Node 4e1749497407faef3f4fcf70b46eed90 failed to complete startupjava.lang.IllegalStateException: Couldn't find any checkpoints for resource: /var/cb-cache/@analytics/v_iodevice_1/storage/partition_73/Default/Default/remote_3gJ1l_volCollection_10_eenxt/0/remote_3gJ1l_volCollection_10_eenxt        at org.apache.asterix.app.nc.IndexCheckpointManager.getLatest(IndexCheckpointManager.java:164) ~[asterix-app.jar:1.0.0-2239]        at com.couchbase.analytics.bootstrap.AnalyticsLocalRecoveryManager.getLatestCheckpoint(AnalyticsLocalRecoveryManager.java:222) ~[columnar-server.jar:1.0.0-2239]        at com.couchbase.analytics.bootstrap.AnalyticsLocalRecoveryManager.recover(AnalyticsLocalRecoveryManager.java:132) ~[columnar-server.jar:1.0.0-2239]        at com.couchbase.analytics.bootstrap.AnalyticsLocalRecoveryManager.cleanUp(AnalyticsLocalRecoveryManager.java:102) ~[columnar-server.jar:1.0.0-2239]        at com.couchbase.analytics.bootstrap.AnalyticsLocalRecoveryManager.startLocalRecovery(AnalyticsLocalRecoveryManager.java:58) ~[columnar-server.jar:1.0.0-2239]        at org.apache.asterix.app.nc.task.LocalRecoveryTask.perform(LocalRecoveryTask.java:45) ~[asterix-app.jar:1.0.0-2239]        at org.apache.asterix.app.replication.message.RegistrationTasksResponseMessage.handle(RegistrationTasksResponseMessage.java:63) ~[asterix-app.jar:1.0.0-2239]        at org.apache.asterix.messaging.NCMessageBroker.lambda$receivedMessage$0(NCMessageBroker.java:108) ~[asterix-app.jar:1.0.0-2239]        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]        at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      

      Attachments

        Activity

          People

            ritesh.agarwal Ritesh Agarwal
            ritesh.agarwal Ritesh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty