Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Columnar 1.0.0
-
Columnar Edition 1.0.0 build 2160 (8-node 8vcpus+ 64 GB)
-
Untriaged
-
0
-
Unknown
-
Analytics Sprint 44
Description
There were a few rebalance failures while changing from 8 nodes to 4 nodes-
Rebalance failure 1 -
seen on 001-
[ns_server:info,2024-06-19T13:49:57.962Z,ns_1@svc-da-node-001.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com:rebalance_agent<0.2650.0>:rebalance_agent:handle_down:290]Rebalancer process <0.11459.144> died (reason {service_rebalance_failed,cbas, |
{worker_died,
|
{'EXIT',<0.11594.144>, |
{task_failed,rebalance,
|
{service_error,
|
<<"Rebalance e5c3974ff50c75bed75df4d0d3e56c46 failed: see analytics_info.log for details">>}}}}}). |
[user:error,2024-06-19T13:49:57.963Z,ns_1@svc-da-node-001.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com:<0.27980.141>:ns_orchestrator:log_rebalance_completion:1661]Rebalance exited with reason {service_rebalance_failed,cbas, |
{worker_died,
|
{'EXIT',<0.11594.144>, |
{task_failed,rebalance,
|
{service_error,
|
<<"Rebalance e5c3974ff50c75bed75df4d0d3e56c46 failed: see analytics_info.log for details">>}}}}}. |
On cursory glance, it looks like the rate limiting problems with S3 for which there is already an open bug
info.log -
2024-06-19T13:49:50.513+00:00 WARN CBAS.s3.S3BufferedWriter [Executor-640:2e82d992bbd54b127606f12bf67f4beb] Multipart upload for storage/partition_40/Database8IGIzJVVR/scope0CGHHDyxO/LinkedDatasetqDxMoDrxeV/0/LinkedDatasetqDxMoDrxeV/40_40_f was aborted |
2024-06-19T13:49:50.514+00:00 ERRO CBAS.impls.LSMHarness [Executor-640:2e82d992bbd54b127606f12bf67f4beb] FLUSH operation {"fileName": "40_40_b", "ioOpID": 530224416} failed on {"dir" : "/var/cb-cache/@analytics/v_iodevice_0/storage/partition_40/Database8IGIzJVVR/scope0CGHHDyxO/LinkedDatasetqDxMoDrxeV/0/LinkedDatasetqDxMoDrxeV", "memory" : [{"state":"READABLE_UNWRITABLE_FLUSHING", "writers":0, "readers":1, "pendingFlushes":0, "id":"[41,41]", "index":{"class": "BTree", "file": "storage/partition_40/Database8IGIzJVVR/scope0CGHHDyxO/LinkedDatasetqDxMoDrxeV/0/LinkedDatasetqDxMoDrxeV_virtual_0"}}, {"state":"INACTIVE", "writers":0, "readers":0, "pendingFlushes":0, "id":"null", "index":{"class": "BTree", "file": "storage/partition_40/Database8IGIzJVVR/scope0CGHHDyxO/LinkedDatasetqDxMoDrxeV/0/LinkedDatasetqDxMoDrxeV_virtual_1"}}], "disk" : 7, "num-scheduled-flushes":1, "current-memory-component":1} |
org.apache.hyracks.api.exceptions.HyracksDataException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: FBD823W01X2Z9D1E, Extended Request ID: 9zreHpNDITTaEYFNMp7tCRg+ob9uRswEHGU+dkpD/7QsFu7hu3Fc0OtDqbVC9wCANuS643igF94=) |
at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:49) ~[hyracks-api.jar:1.0.0-2160] |
at org.apache.asterix.cloud.clients.aws.s3.S3BufferedWriter.completeMultipartUpload(S3BufferedWriter.java:136) ~[asterix-cloud.jar:1.0.0-2160] |
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: FJY2KVN3WNDQX5XP, Extended Request ID: 4QXgFrb/MtYzTQfN2rro6AfdfUocGO5dmSMaWiq78NOE0NmKFbss0nOwfkX9iEl4GpqcXbwQjB4=) |
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: STK3BEMAVVHV4VC4, Extended Request ID: X1dmHtDXeIkt6BufSN5sovWuqHzur/QMeDFwsYxt/0IQXQ239CzAcHWWrFFLFr3Y0O9Rp4unZF4=) |
2024-06-19T13:49:50.666+00:00 ERRO CBAS.nc.HaltCallback [Executor-640:2e82d992bbd54b127606f12bf67f4beb] Operation {"fileName": "40_40_b", "ioOpID": 530224416} has failed |
org.apache.hyracks.api.exceptions.HyracksDataException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: FBD823W01X2Z9D1E, Extended Request ID: 9zreHpNDITTaEYFNMp7tCRg+ob9uRswEHGU+dkpD/7QsFu7hu3Fc0OtDqbVC9wCANuS643igF94=) |
rebalance failure 2 -
[user:error,2024-06-19T13:45:57.285Z,ns_1@svc-da-node-001.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com:<0.27980.141>:ns_orchestrator:log_rebalance_completion:1661]Rebalance exited with reason {service_rebalance_failed,cbas, |
{worker_died,
|
{'EXIT',<0.6489.144>, |
{task_failed,rebalance,
|
{service_error,
|
<<"Rebalance 8d6ddcabbd6a62cc073a572c08a97010 failed: see analytics_info.log for details">>}}}}}. |
info.log ->
2024-06-19T13:44:32.359+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[7]:TO} |
2024-06-19T13:44:32.359+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[16]:TO} |
2024-06-19T13:44:32.359+00:00 INFO CBAS.runtime.TopicOperatorNodePushable [Executor-1751:2e82d992bbd54b127606f12bf67f4beb] (Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[7]:TO aborting |
2024-06-19T13:44:32.359+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[21]:TO} |
2024-06-19T13:44:32.359+00:00 INFO CBAS.runtime.TopicOperatorNodePushable [Executor-1758:2e82d992bbd54b127606f12bf67f4beb] (Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[16]:TO aborting |
2024-06-19T13:44:32.360+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[23]:TO} |
2024-06-19T13:44:32.360+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[24]:TO} |
2024-06-19T13:44:32.360+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[29]:TO} |
2024-06-19T13:44:32.360+00:00 INFO CBAS.adapter.TopicRecordReader [Executor-1681:2e82d992bbd54b127606f12bf67f4beb] Error while polling data from Kafka |
org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
|
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:535) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:296) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:251) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1255) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1186) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) ~[kafka-clients-3.6.0.jar:?] |
at com.couchbase.analytics.adapter.TopicRecordReader.dataPoll(TopicRecordReader.java:365) [columnar-connector.jar:1.0.0-2160] |
at com.couchbase.analytics.adapter.TopicRecordReader$KafkaTopicDataPoller.run(TopicRecordReader.java:393) [columnar-connector.jar:1.0.0-2160] |
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] |
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] |
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] |
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] |
at java.base/java.lang.Thread.run(Thread.java:840) [?:?] |
Caused by: java.lang.InterruptedException
|
... 13 more |
2024-06-19T13:44:32.360+00:00 INFO CBAS.adapter.TopicRecordReader [Executor-1641:2e82d992bbd54b127606f12bf67f4beb] Error while polling data from Kafka |
org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
|
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:535) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:296) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:251) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1255) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1186) ~[kafka-clients-3.6.0.jar:?] |
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) ~[kafka-clients-3.6.0.jar:?] |
at com.couchbase.analytics.adapter.TopicRecordReader.dataPoll(TopicRecordReader.java:365) [columnar-connector.jar:1.0.0-2160] |
at com.couchbase.analytics.adapter.TopicRecordReader$KafkaTopicDataPoller.run(TopicRecordReader.java:393) [columnar-connector.jar:1.0.0-2160] |
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?] |
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] |
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] |
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] |
at java.base/java.lang.Thread.run(Thread.java:840) [?:?] |
Caused by: java.lang.InterruptedException
|
... 13 more |
cbcollect ->
https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-001.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-003.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-005.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-007.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-008.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip