Details
-
Bug
-
Resolution: Unresolved
-
Major
-
Columnar 1.0.0
-
1.0.0-2203
-
Untriaged
-
0
-
Unknown
Description
After scaling down the cluster, the cluster was unusable for brief periods.
Case 1
The rebalance was completed at 2024-07-10T10:37:54.032Z. But we see unusable messages from 2024-07-10T10:50:44.396 until 2024-07-10T10:59:44
Seen on node 008-
2024-07-10T10:50:44.396+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-1] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=24880163-e6dc-4b83-9e8c-f06afbd42af1 |
2024-07-10T10:54:20.453+00:00 INFO CBAS.cbas updating |
.....
|
....
|
|
2024-07-10T10:59:44.488+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-6] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=6c41781f-b462-4cc0-ad49-e4de37251372 |
2024-07-10T10:59:45.784+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:e71ff082812e12d4c853d428a9a7f9c2] Received message: StorageCleanupRequestMessage |
2024-07-10T10:59:46.062+00:00 INFO CBAS |
During this time, we see these messages on node 004
2024-07-10T10:57:22.457+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-1] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"} |
We also see these messages on 022 -
|
b'2024-07-10T10:56:51.536+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-0] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:57:22.457+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-1] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:57:52.305+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-2] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:58:22.288+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-3] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:58:52.306+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-4] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:59:34.820+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-5] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:59:34.938+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-6] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"} |
Case 2 - (this is a very brief period and it might not even be indicative of something wrong. Could just be transient network problems). Cluster scaled down from 8 to 4 nodes and the rebalance was completed at 2024-07-10T13:14:10.555.
Seen on 016
2024-07-10T13:45:21.521+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-11] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=ddebf36d-7d4d-4ebe-a95e-4df23ba8009a |
2024-07-10T13:45:21.532+00:00 INFO CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-12] handleRequest: uuid=dbebb985-0042-4157-bed3-c84f74238b26, clientContextID=null, |
|
.......
|
|
2024-07-10T13:45:21.561+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-15] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=79e823ab-b1ff-4903-9e85-92814ce962a6 |
The first case is perhaps worth looking into since it was for a period of ~10 mins. But both cases happened about 20-30 mins post rebalance operations (not sure if rebalance has anything to do with this.)
cbcollect case 1 ->
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-004.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-006.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-008.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-014.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-016.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-017.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-022.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-029.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
cbcollect case 2 ->
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-006.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-008.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-016.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-022.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
Marking it as Major because the cluster recovered and was able to serve requests later. If RCA determines that it needs to be lower or higher, please change accordingly.
Attachments
Issue Links
Gerrit Reviews
For Gerrit Dashboard: MB-62680 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
212797,2 | MB-62680: += ccNodeName to cluster state api | master | cbas-core | Status: MERGED | +2 | +1 |
212883,3 | MB-62680: fix expected test result | master | cbas-core | Status: MERGED | +2 | +1 |