Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62680

[System Test] [Intermittent] Cluster unusable for a brief period after scale down operations

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Ionic
    • Columnar 1.0.0
    • analytics
    • 1.0.0-2203
    • Untriaged
    • 0
    • Unknown

    Description

      After scaling down the cluster, the cluster was unusable for brief periods.

      Case 1

      The rebalance was completed at 2024-07-10T10:37:54.032Z. But we see unusable messages from 2024-07-10T10:50:44.396 until 2024-07-10T10:59:44

      Seen on node 008-

      2024-07-10T10:50:44.396+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-1] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=24880163-e6dc-4b83-9e8c-f06afbd42af1
      2024-07-10T10:54:20.453+00:00 INFO CBAS.cbas updating 
      .....
      ....
       
      2024-07-10T10:59:44.488+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-6] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=6c41781f-b462-4cc0-ad49-e4de37251372
      2024-07-10T10:59:45.784+00:00 INFO CBAS.messaging.NCMessageBroker [Worker:e71ff082812e12d4c853d428a9a7f9c2] Received message: StorageCleanupRequestMessage
      2024-07-10T10:59:46.062+00:00 INFO CBAS
      

      During this time, we see these messages on node 004

      2024-07-10T10:57:22.457+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-1] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}
      

      We also see these messages on 022 -

       
       b'2024-07-10T10:56:51.536+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-0] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:57:22.457+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-1] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:57:52.305+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-2] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:58:22.288+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-3] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:58:52.306+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-4] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:59:34.820+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-5] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}\n2024-07-10T10:59:34.938+00:00 INFO CBAS.server.AbstractServlet [HttpExecutor(port:9111)-6] sendError: status=503 Service Unavailable, message={"errors":[{"code":23000,"msg":"Analytics Service is temporarily unavailable","retriable":true}],"status":"errors"}
      

      Case 2 - (this is a very brief period and it might not even be indicative of something wrong. Could just be transient network problems). Cluster scaled down from 8 to 4 nodes and the rebalance was completed at 2024-07-10T13:14:10.555.

      Seen on 016

      2024-07-10T13:45:21.521+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-11] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=ddebf36d-7d4d-4ebe-a95e-4df23ba8009a
      2024-07-10T13:45:21.532+00:00 INFO CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-12] handleRequest: uuid=dbebb985-0042-4157-bed3-c84f74238b26, clientContextID=null, 
       
      .......
       
      2024-07-10T13:45:21.561+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-15] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=79e823ab-b1ff-4903-9e85-92814ce962a6
      

      The first case is perhaps worth looking into since it was for a period of ~10 mins. But both cases happened about 20-30 mins post rebalance operations (not sure if rebalance has anything to do with this.)

      cbcollect case 1 ->

      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-004.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-006.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-008.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-014.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-016.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-017.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-022.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T112856-ns_1%40svc-da-node-029.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip

      cbcollect case 2 ->

      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-006.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-008.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-016.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul9/collectinfo-2024-07-10T145129-ns_1%40svc-da-node-022.adewm3olqtoa4vfw.sandbox.nonprod-project-avengers.com.zip

      Marking it as Major because the cluster recovered and was able to serve requests later. If RCA determines that it needs to be lower or higher, please change accordingly.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-62680
          # Subject Branch Project Status CR V

          Activity

            People

              murtadha.hubail Murtadha Hubail
              pavan.pb Pavan PB
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty