Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62415

[System Test] Rebalance failures during scale operations

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Columnar 1.0.0
    • Columnar 1.0.0
    • analytics
    • Columnar Edition 1.0.0 build 2160 (8-node 8vcpus+ 64 GB)
    • Untriaged
    • 0
    • Unknown
    • Analytics Sprint 44

    Description

      There were a few rebalance failures while changing from 8 nodes to 4 nodes-

      Rebalance failure 1 -

      seen on 001-

      [ns_server:info,2024-06-19T13:49:57.962Z,ns_1@svc-da-node-001.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com:rebalance_agent<0.2650.0>:rebalance_agent:handle_down:290]Rebalancer process <0.11459.144> died (reason {service_rebalance_failed,cbas,
                                                     {worker_died,
                                                      {'EXIT',<0.11594.144>,
                                                       {task_failed,rebalance,
                                                        {service_error,
                                                         <<"Rebalance e5c3974ff50c75bed75df4d0d3e56c46 failed: see analytics_info.log for details">>}}}}}).
      [user:error,2024-06-19T13:49:57.963Z,ns_1@svc-da-node-001.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com:<0.27980.141>:ns_orchestrator:log_rebalance_completion:1661]Rebalance exited with reason {service_rebalance_failed,cbas,
                                    {worker_died,
                                     {'EXIT',<0.11594.144>,
                                      {task_failed,rebalance,
                                       {service_error,
                                        <<"Rebalance e5c3974ff50c75bed75df4d0d3e56c46 failed: see analytics_info.log for details">>}}}}}.
      

      On cursory glance, it looks like the rate limiting problems with S3 for which there is already an open bug
      info.log -

      2024-06-19T13:49:50.513+00:00 WARN CBAS.s3.S3BufferedWriter [Executor-640:2e82d992bbd54b127606f12bf67f4beb] Multipart upload for storage/partition_40/Database8IGIzJVVR/scope0CGHHDyxO/LinkedDatasetqDxMoDrxeV/0/LinkedDatasetqDxMoDrxeV/40_40_f was aborted
      2024-06-19T13:49:50.514+00:00 ERRO CBAS.impls.LSMHarness [Executor-640:2e82d992bbd54b127606f12bf67f4beb] FLUSH operation {"fileName": "40_40_b", "ioOpID": 530224416} failed on {"dir" : "/var/cb-cache/@analytics/v_iodevice_0/storage/partition_40/Database8IGIzJVVR/scope0CGHHDyxO/LinkedDatasetqDxMoDrxeV/0/LinkedDatasetqDxMoDrxeV", "memory" : [{"state":"READABLE_UNWRITABLE_FLUSHING", "writers":0, "readers":1, "pendingFlushes":0, "id":"[41,41]", "index":{"class": "BTree", "file": "storage/partition_40/Database8IGIzJVVR/scope0CGHHDyxO/LinkedDatasetqDxMoDrxeV/0/LinkedDatasetqDxMoDrxeV_virtual_0"}}, {"state":"INACTIVE", "writers":0, "readers":0, "pendingFlushes":0, "id":"null", "index":{"class": "BTree", "file": "storage/partition_40/Database8IGIzJVVR/scope0CGHHDyxO/LinkedDatasetqDxMoDrxeV/0/LinkedDatasetqDxMoDrxeV_virtual_1"}}], "disk" : 7, "num-scheduled-flushes":1, "current-memory-component":1}
      org.apache.hyracks.api.exceptions.HyracksDataException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: FBD823W01X2Z9D1E, Extended Request ID: 9zreHpNDITTaEYFNMp7tCRg+ob9uRswEHGU+dkpD/7QsFu7hu3Fc0OtDqbVC9wCANuS643igF94=)
      	at org.apache.hyracks.api.exceptions.HyracksDataException.create(HyracksDataException.java:49) ~[hyracks-api.jar:1.0.0-2160]
      	at org.apache.asterix.cloud.clients.aws.s3.S3BufferedWriter.completeMultipartUpload(S3BufferedWriter.java:136) ~[asterix-cloud.jar:1.0.0-2160]
      	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: FJY2KVN3WNDQX5XP, Extended Request ID: 4QXgFrb/MtYzTQfN2rro6AfdfUocGO5dmSMaWiq78NOE0NmKFbss0nOwfkX9iEl4GpqcXbwQjB4=)
      	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: STK3BEMAVVHV4VC4, Extended Request ID: X1dmHtDXeIkt6BufSN5sovWuqHzur/QMeDFwsYxt/0IQXQ239CzAcHWWrFFLFr3Y0O9Rp4unZF4=)
      2024-06-19T13:49:50.666+00:00 ERRO CBAS.nc.HaltCallback [Executor-640:2e82d992bbd54b127606f12bf67f4beb] Operation {"fileName": "40_40_b", "ioOpID": 530224416} has failed
      org.apache.hyracks.api.exceptions.HyracksDataException: software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: FBD823W01X2Z9D1E, Extended Request ID: 9zreHpNDITTaEYFNMp7tCRg+ob9uRswEHGU+dkpD/7QsFu7hu3Fc0OtDqbVC9wCANuS643igF94=)
      

      rebalance failure 2 -

      [user:error,2024-06-19T13:45:57.285Z,ns_1@svc-da-node-001.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com:<0.27980.141>:ns_orchestrator:log_rebalance_completion:1661]Rebalance exited with reason {service_rebalance_failed,cbas,
                                    {worker_died,
                                     {'EXIT',<0.6489.144>,
                                      {task_failed,rebalance,
                                       {service_error,
                                        <<"Rebalance 8d6ddcabbd6a62cc073a572c08a97010 failed: see analytics_info.log for details">>}}}}}.
      

      info.log ->

      2024-06-19T13:44:32.359+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[7]:TO}
      2024-06-19T13:44:32.359+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[16]:TO}
      2024-06-19T13:44:32.359+00:00 INFO CBAS.runtime.TopicOperatorNodePushable [Executor-1751:2e82d992bbd54b127606f12bf67f4beb] (Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[7]:TO aborting
      2024-06-19T13:44:32.359+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[21]:TO}
      2024-06-19T13:44:32.359+00:00 INFO CBAS.runtime.TopicOperatorNodePushable [Executor-1758:2e82d992bbd54b127606f12bf67f4beb] (Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[16]:TO aborting
      2024-06-19T13:44:32.360+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[23]:TO}
      2024-06-19T13:44:32.360+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[24]:TO}
      2024-06-19T13:44:32.360+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:2e82d992bbd54b127606f12bf67f4beb] Received message: ActiveManagerMessage{kind=STOP_ACTIVITY, runtimeId=(Default.KafkaLinkbXxEQJMNVS.b-3-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-2-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196,b-1-public.qekafkatestcluster.7b9vtv.c13.kafka.us-east-1.amazonaws.com:9196(CB))[29]:TO}
      2024-06-19T13:44:32.360+00:00 INFO CBAS.adapter.TopicRecordReader [Executor-1681:2e82d992bbd54b127606f12bf67f4beb] Error while polling data from Kafka
      org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:535) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:296) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:251) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1255) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1186) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) ~[kafka-clients-3.6.0.jar:?]
      	at com.couchbase.analytics.adapter.TopicRecordReader.dataPoll(TopicRecordReader.java:365) [columnar-connector.jar:1.0.0-2160]
      	at com.couchbase.analytics.adapter.TopicRecordReader$KafkaTopicDataPoller.run(TopicRecordReader.java:393) [columnar-connector.jar:1.0.0-2160]
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      Caused by: java.lang.InterruptedException
      	... 13 more
      2024-06-19T13:44:32.360+00:00 INFO CBAS.adapter.TopicRecordReader [Executor-1641:2e82d992bbd54b127606f12bf67f4beb] Error while polling data from Kafka
      org.apache.kafka.common.errors.InterruptException: java.lang.InterruptedException
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.maybeThrowInterruptException(ConsumerNetworkClient.java:535) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:296) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:251) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1255) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1186) ~[kafka-clients-3.6.0.jar:?]
      	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159) ~[kafka-clients-3.6.0.jar:?]
      	at com.couchbase.analytics.adapter.TopicRecordReader.dataPoll(TopicRecordReader.java:365) [columnar-connector.jar:1.0.0-2160]
      	at com.couchbase.analytics.adapter.TopicRecordReader$KafkaTopicDataPoller.run(TopicRecordReader.java:393) [columnar-connector.jar:1.0.0-2160]
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
      	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
      Caused by: java.lang.InterruptedException
      	... 13 more
      

      cbcollect ->

      https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-001.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-003.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-005.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-007.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar18Jun/collectinfo-2024-06-19T135257-ns_1%40svc-da-node-008.45yxxjbrevilttq.sandbox.nonprod-project-avengers.com.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            pavan.pb Pavan PB
            pavan.pb Pavan PB
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty