Details
-
Bug
-
Resolution: Unresolved
-
Major
-
Columnar 1.0.0
-
1.0.0-2203
-
Untriaged
-
0
-
Unknown
Description
After scaling down the cluster, the cluster was unusable for brief periods.
Case 1
The rebalance was completed at 2024-07-10T10:37:54.032Z. But we see unusable messages from 2024-07-10T10:50:44.396 until 2024-07-10T10:59:44
Seen on node 008-
2024-07-10T10:50:44.396+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-1] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=24880163-e6dc-4b83-9e8c-f06afbd42af1 |
2024-07-10T10:54:20.453+00:00 INFO CBAS.cbas updating |
.....
|
....
|
|
2024-07-10T10:59:44.488+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-6] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=6c41781f-b462-4cc0-ad49-e4de37251372 |
2024-07-10T10:59:45.784+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:e71ff082812e12d4c853d428a9a7f9c2] Received message: StorageCleanupRequestMessage |
2024-07-10T10:59:46.062+00:00 INFO CBAS |
During this time, we see these messages on node 004
2024-07-10T10:57:22.457+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-1] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"} |
We also see these messages on 022 -
|
b'2024-07-10T10:56:51.536+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-0] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:57:22.457+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-1] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:57:52.305+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-2] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:58:22.288+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-3] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:58:52.306+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-4] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:59:34.820+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-5] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:59:34.938+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-6] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"} |
Case 2 - (this is a very brief period and it might not even be indicative of something wrong. Could just be transient network problems). Cluster scaled down from 8 to 4 nodes and the rebalance was completed at 2024-07-10T13:14:10.555.
Seen on 016
2024-07-10T13:45:21.521+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-11] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=ddebf36d-7d4d-4ebe-a95e-4df23ba8009a |
2024-07-10T13:45:21.532+00:00 INFO CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-12] handleRequest: uuid=dbebb985-0042-4157-bed3-c84f74238b26, clientContextID=null, |
|
.......
|
|
2024-07-10T13:45:21.561+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-15] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=79e823ab-b1ff-4903-9e85-92814ce962a6 |
The first case is perhaps worth looking into since it was for a period of ~10 mins. But both cases happened about 20-30 mins post rebalance operations (not sure if rebalance has anything to do with this.)
cbcollect case 1 ->
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-004.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-006.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-008.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-014.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-016.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-017.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-022.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-029.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
cbcollect case 2 ->
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-006.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-008.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-016.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-022.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
Marking it as Major because the cluster recovered and was able to serve requests later. If RCA determines that it needs to be lower or higher, please change accordingly.