Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52379

System Test - org.apache.asterix.common.exceptions.RuntimeDataException is observed

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not a Bug
    • 7.1.1
    • None
    • analytics
    • 7.1.1-3094

    Description

      Following errors are observed on one of the CBAS nodes during system test run -

      2022-05-27 17:42:32,054 - systestmon - WARNING - *** 2 occurences of ASX keyword found on 172.23.106.188 ***
      2022-05-27 17:42:32,054 - systestmon - DEBUG - Caused by: org.apache.asterix.common.exceptions.RuntimeDataException: ASX0023: 600s passed before getting back the responses from NCs
      2022-05-27 17:42:32,054 - systestmon - DEBUG - Caused by: org.apache.asterix.common.exceptions.RuntimeDataException: ASX0023: 600s passed before getting back the responses from NCs
      

      Nodes with cbas service : [u'172.23.104.70', u'172.23.105.168', u'172.23.106.188']

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          murtadha.hubail Murtadha Hubail added a comment - - edited

          Those error messages belong to a timeout of 600 seconds to wait for all Analytics nodes to send their latest DCP state for each collection to the CC to determine the DCP stream starting point.

          When the timeout happened, these were the Analytics nodes in the cluster:
          172.23.105.168
          172.23.105.188
          172.23.99.11

          The link was suspeneded to drop a dataset then it was resumed and that's when we asked the nodes for their DCP state:

          2022-05-27T16:12:53.601-07:00 INFO CBAS.adapter.CouchbaseConnectorFactory [default (local) aggregated shadow state calculator] beginning starting state calculation for [dv_13.ds_47, dv_15.ds_57, dv_24.dv_24.ds_34, dv_22.ds_41, dv_17.ds_33, dv_23.ds_51, dv_20.dv_20.ds_56, dv_22.ds_39, dv_18.ds_53, dv_19.dv_19.ds_58]
          

          I noticed from the logs that the node 172.23.99.11 was taking an unusual amount of time to respond to these DCP state requests and in one instance it took over 2 mins:

          2022-05-27T16:14:50.594-07:00 INFO CBAS.messaging.CCMessageBroker [Executor-1290:ClusterController] Received message: {"class":"ShadowStatesResponse","max":["0:191223181134580,492327,492327","1:272905070342304,576575,576575","2:210006169690493,409683,40968...,"mins":{"dv_23":{"ds_51":["0:191223181134580,492327,492327","1:272905070342304,576575,576575","2:210006169690493,409683,40968...}}}
          

          600 passed and some DCP state requests were still not answer and as a result the connect attempt failed:

          2022-05-27T16:22:54.102-07:00 INFO CBAS.adapter.CouchbaseConnectorFactory [default (local) aggregated shadow state calculator] starting state calculation failed for [dv_13.ds_47, dv_15.ds_57, dv_24.dv_24.ds_34, dv_22.ds_41, dv_17.ds_33, dv_23.ds_51, dv_20.dv_20.ds_56, dv_22.ds_39, dv_18.ds_53, dv_19.dv_19.ds_58] (listener {"class":"BucketEventsListener","entityId":"dv_17/Local/default","state":"RUNNING"})
          org.apache.asterix.common.exceptions.RuntimeDataException: ASX0023: 600s passed before getting back the responses from NCs
          	at com.couchbase.analytics.adapter.CouchbaseConnectorFactory.waitForShadowStateFutures(CouchbaseConnectorFactory.java:545) ~[cbas-connector-7.1.1-3094.jar:7.1.1-3094]
          	at com.couchbase.analytics.adapter.CouchbaseConnectorFactory.calculateStartingPoint(CouchbaseConnectorFactory.java:483) ~[cbas-connector-7.1.1-3094.jar:7.1.1-3094]
          	at com.couchbase.analytics.adapter.CouchbaseConnectorFactory.lambda$startCalculateStartingPointTask$3(CouchbaseConnectorFactory.java:468) ~[cbas-connector-7.1.1-3094.jar:7.1.1-3094]
          	at com.couchbase.analytics.util.BlockingUnwindFuture.lambda$submit$1(BlockingUnwindFuture.java:50) ~[cbas-common-7.1.1-3094.jar:7.1.1-3094]
          	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
          	at java.lang.Thread.run(Thread.java:829) [?:?]
          

          I checked the logs for the nodes (172.23.105.188, 172.23.105.168) and it seems that they processed the requests in a timely fashion. Unfortunely, I don't have the logs for 172.23.99.11 to check why it was taking longer than usual to respond. In any case, the system will attempt to connect the DCP stream again automatically.

          Umang , if you have the logs for the node 172.23.99.11, please share them. Even though all of this works as designed, I would like to check why it is taking longer than 10 mins to respond to a DCP state request and ensure that we don't have an issue.

          murtadha.hubail Murtadha Hubail added a comment - - edited Those error messages belong to a timeout of 600 seconds to wait for all Analytics nodes to send their latest DCP state for each collection to the CC to determine the DCP stream starting point. When the timeout happened, these were the Analytics nodes in the cluster: 172.23.105.168 172.23.105.188 172.23.99.11 The link was suspeneded to drop a dataset then it was resumed and that's when we asked the nodes for their DCP state: 2022-05-27T16:12:53.601-07:00 INFO CBAS.adapter.CouchbaseConnectorFactory [default (local) aggregated shadow state calculator] beginning starting state calculation for [dv_13.ds_47, dv_15.ds_57, dv_24.dv_24.ds_34, dv_22.ds_41, dv_17.ds_33, dv_23.ds_51, dv_20.dv_20.ds_56, dv_22.ds_39, dv_18.ds_53, dv_19.dv_19.ds_58] I noticed from the logs that the node 172.23.99.11 was taking an unusual amount of time to respond to these DCP state requests and in one instance it took over 2 mins: 2022-05-27T16:14:50.594-07:00 INFO CBAS.messaging.CCMessageBroker [Executor-1290:ClusterController] Received message: {"class":"ShadowStatesResponse","max":["0:191223181134580,492327,492327","1:272905070342304,576575,576575","2:210006169690493,409683,40968...,"mins":{"dv_23":{"ds_51":["0:191223181134580,492327,492327","1:272905070342304,576575,576575","2:210006169690493,409683,40968...}}} 600 passed and some DCP state requests were still not answer and as a result the connect attempt failed: 2022-05-27T16:22:54.102-07:00 INFO CBAS.adapter.CouchbaseConnectorFactory [default (local) aggregated shadow state calculator] starting state calculation failed for [dv_13.ds_47, dv_15.ds_57, dv_24.dv_24.ds_34, dv_22.ds_41, dv_17.ds_33, dv_23.ds_51, dv_20.dv_20.ds_56, dv_22.ds_39, dv_18.ds_53, dv_19.dv_19.ds_58] (listener {"class":"BucketEventsListener","entityId":"dv_17/Local/default","state":"RUNNING"}) org.apache.asterix.common.exceptions.RuntimeDataException: ASX0023: 600s passed before getting back the responses from NCs at com.couchbase.analytics.adapter.CouchbaseConnectorFactory.waitForShadowStateFutures(CouchbaseConnectorFactory.java:545) ~[cbas-connector-7.1.1-3094.jar:7.1.1-3094] at com.couchbase.analytics.adapter.CouchbaseConnectorFactory.calculateStartingPoint(CouchbaseConnectorFactory.java:483) ~[cbas-connector-7.1.1-3094.jar:7.1.1-3094] at com.couchbase.analytics.adapter.CouchbaseConnectorFactory.lambda$startCalculateStartingPointTask$3(CouchbaseConnectorFactory.java:468) ~[cbas-connector-7.1.1-3094.jar:7.1.1-3094] at com.couchbase.analytics.util.BlockingUnwindFuture.lambda$submit$1(BlockingUnwindFuture.java:50) ~[cbas-common-7.1.1-3094.jar:7.1.1-3094] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] I checked the logs for the nodes (172.23.105.188, 172.23.105.168) and it seems that they processed the requests in a timely fashion. Unfortunely, I don't have the logs for 172.23.99.11 to check why it was taking longer than usual to respond. In any case, the system will attempt to connect the DCP stream again automatically. Umang , if you have the logs for the node 172.23.99.11, please share them. Even though all of this works as designed, I would like to check why it is taking longer than 10 mins to respond to a DCP state request and ensure that we don't have an issue.
          umang.agrawal Umang added a comment -

          Attaching logs for node 172.23.99.11

          umang.agrawal Umang added a comment - Attaching logs for node 172.23.99.11

          Umang, the attached logs for node 172.23.99.11 end on 2022-05-26T16:39:23 while the issue happened on 2022-05-27T16:22:54. Also, the node id in the attached logs is 6f9fc7a5e0abac157b29cc9dc067f25d and the node id when the issue happened was 645c17ebe599451444b7f95b2fe18f04. Please reopen the issue if you find the logs with the node id 645c17ebe599451444b7f95b2fe18f04.

          murtadha.hubail Murtadha Hubail added a comment - Umang , the attached logs for node 172.23.99.11 end on 2022-05-26T16:39:23 while the issue happened on 2022-05-27T16:22:54. Also, the node id in the attached logs is 6f9fc7a5e0abac157b29cc9dc067f25d and the node id when the issue happened was 645c17ebe599451444b7f95b2fe18f04. Please reopen the issue if you find the logs with the node id 645c17ebe599451444b7f95b2fe18f04.

          People

            umang.agrawal Umang
            umang.agrawal Umang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty