Details
-
Bug
-
Resolution: Fixed
-
Major
-
Columnar 1.0.0
-
1.0.0-2216
-
Untriaged
-
0
-
Unknown
-
Analytics Sprint 46
Description
The workload is as follows
Type | Number of collections | Number of items in millions | Total count in millions |
---|---|---|---|
Remote | 80 | 75 | 6000 |
Standalone | 50 | 8 | 4000* |
Kafka | 30 | 33.5 | ~1000 |
The change from the previous runs has been the increase in the number of Kafka collections.
*Some standalone collections have 8 mil and some have multiples of 8 million items. The total doc count is 4000 million ( 4 billion) items.
Number of links = 6 ( 2 remote + 2 external + 2 kafka). 1 remote link and 1 kafka link is active.
After scale-up operation (from 8 to 16 nodes), there were a bunch of rate limiting messages seen.
Scaling was completed at 2024-07-18T11:14:40.146Z
Rebalance report
{"stageInfo":{"analytics":{"totalProgress":100,"perNodeProgress":{"ns_1@svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1},"startTime":"2024-07-18T10:35:08.753Z","completedTime":"2024-07-18T11:14:40.146Z","timeTaken":2371393},"data":{"totalProgress":100,"perNodeProgress":{"ns_1@svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1},"startTime":"2024-07-18T10:35:08.527Z","completedTime":"2024-07-18T10:35:08.753Z","timeTaken":226}},"rebalanceId":"615d2f4728addeb7455daf6301c60a39","nodesInfo":{"active_nodes":["ns_1@svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com"],"keep_nodes":["ns_1@svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","startTime":"2024-07-18T10:35:08.525Z","completedTime":"2024-07-18T11:14:40.172Z","timeTaken":2371647,"completionMessage":"Rebalance completed successfully."} |
The messages start appearing during the scaling operation -
2024-07-18T11:01:52.077+00:00 ERRO CBAS.impls.LSMHarness [Executor-312:9f94db4ff041223ad40d69a2cb21456b] MERGE operation.afterFinalize failed on {"dir" : "/var/cb-cache/@analytics/v_iodevice_10/storage/partition_26/Database8PrNChAFZ/scope0NbrxYVRQ/remotedatasetmBjsMIbj/0/remotedatasetmBjsMIbj", "memory" : [{"state":"INACTIVE", "writers":0, "readers":0, "pendingFlushes":0, "id":"[349,349]", "index":{"class": "BTree", "file": "storage/partition_26/Database8PrNChAFZ/scope0NbrxYVRQ/remotedatasetmBjsMIbj/0/remotedatasetmBjsMIbj_virtual_0"}}, {"state":"READABLE_WRITABLE", "writers":0, "readers":0, "pendingFlushes":0, "id":"[350,350]", "index":{"class": "BTree", "file": "storage/partition_26/Database8PrNChAFZ/scope0NbrxYVRQ/remotedatasetmBjsMIbj/0/remotedatasetmBjsMIbj_virtual_1"}}], "disk" : 5, "num-scheduled-flushes":0, "current-memory-component":1} |
software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: R3SP10WN1SEC03Z4, Extended Request ID: VCrDwu18le+EIIBPrmtfMaoNmZdsgxLEdB/xVBFw2RjJXzQ64AF7AZfmwXVBgLRR4GdSJn46O4A=) |
They have continued until
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 39AJP84GDAGJEAAF, Extended Request ID: S/be7JzGZevzVV6OgR/pIfSaoNaPZJRnnBPLXAzY2it+1Kp5bd8af6U4YNZCuJwPpi/i63nYKr0=) |
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 39AK2VGR2DW1AK6Y, Extended Request ID: r9dxi+N8senMxw70MYo3Bkuk8PXitkjOFogXQEDOKJemVeKqUGTDaL7Ir0K2BQLAmaTg/zV8Lbs=) |
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 6AYXX1AMEY7VAXVS, Extended Request ID: y7IoivgNWfE0uHum/f3VtZgh5tYMRkADNCT7oonBlvYRfobd5sLZ8r/MUljhSk3pJtffuKLZ5Qg=) |
2024-07-18T13:37:54.328+00:00 FATA CBAS.util.ExitUtil [Executor-80:9f94db4ff041223ad40d69a2cb21456b] JVM halting with status 88 (halting thread Thread[Executor-80:9f94db4ff041223ad40d69a2cb21456b,5,main], interrupted false) |
2024-07-18T13:37:54.724+00:00 FATA CBAS.util.ExitUtil [pool-2-thread-1] Thread dump at halt: |
The cluster has become unusable as we see a lot of these messages spanning from 2024-07-18T11:18:44 until 2024-07-18T13:49 -
2024-07-18T11:18:44.253+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-1] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=4a99209e-2b1c-431e-b267-5e59cbb116f3 |
2024-07-18T11:18:45.150+00:00 INFO CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-2] handleRequest: uuid=259f9295-c90e-4c36-beb6-aa0b7176bc97, clientContextID=null, {"host":"cb.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com:18095","path":"/analytics/service","statement":"<ud>completed_requests()</ud>","pretty":false,"mode":"immediate","clientContextID":null,"clientType":"ASTERIX","dataverse":null,"format":"CLEAN_JSON","timeout":9223372036854775807,"maxResultReads":1,"planFormat":"JSON","expressionTree":false,"rewrittenExpressionTree":false,"logicalPlan":false,"optimizedLogicalPlan":false,"job":false,"profile":"counts","signature":true,"multiStatement":true,"parseOnly":false,"readOnly":false,"maxWarnings":0,"sqlCompat":false,"source":null,"scanConsistency":null,"scanWait":null} |
|
2024-07-18T13:49:32.363+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-15] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=66ce7c04-e156-4ea9-b1d0-2cc626d43f87 |
2024-07-18T13:50:33.952+00:00 WARN CBAS.cbas request to proxy /analytics/node/diagnostics to svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com:9110 failed: Get "https://svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com:9110/analytics/node/diagnostics": context deadline exceeded (Client.Timeout exceeded while awaiting headers) |
Not sure if the root cause is S3 rate limiting or if there's something else going on. But the crux of the problem is that the cluster is unusable.
cbcollect ->
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip