Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62795

[System Test] Cluster unusable post scaling operations

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Columnar 1.0.0
    • Columnar 1.0.0
    • analytics
    • 1.0.0-2216
    • Untriaged
    • 0
    • Unknown
    • Analytics Sprint 46

    Description

      The workload is as follows

      Type Number of collections Number of items in millions Total count in millions
      Remote 80 75 6000
      Standalone 50 8 4000*
      Kafka 30 33.5 ~1000

      The change from the previous runs has been the increase in the number of Kafka collections.

      *Some standalone collections have 8 mil and some have multiples of 8 million items. The total doc count is 4000 million ( 4 billion) items.
      Number of links = 6 ( 2 remote + 2 external + 2 kafka). 1 remote link and 1 kafka link is active.

      After scale-up operation (from 8 to 16 nodes), there were a bunch of rate limiting messages seen.

      Scaling was completed at 2024-07-18T11:14:40.146Z

      Rebalance report

      {"stageInfo":{"analytics":{"totalProgress":100,"perNodeProgress":{"ns_1@svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1},"startTime":"2024-07-18T10:35:08.753Z","completedTime":"2024-07-18T11:14:40.146Z","timeTaken":2371393},"data":{"totalProgress":100,"perNodeProgress":{"ns_1@svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1,"ns_1@svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com":1},"startTime":"2024-07-18T10:35:08.527Z","completedTime":"2024-07-18T10:35:08.753Z","timeTaken":226}},"rebalanceId":"615d2f4728addeb7455daf6301c60a39","nodesInfo":{"active_nodes":["ns_1@svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com"],"keep_nodes":["ns_1@svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","ns_1@svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]},"masterNode":"ns_1@svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com","startTime":"2024-07-18T10:35:08.525Z","completedTime":"2024-07-18T11:14:40.172Z","timeTaken":2371647,"completionMessage":"Rebalance completed successfully."}
      

      The messages start appearing during the scaling operation -

      2024-07-18T11:01:52.077+00:00 ERRO CBAS.impls.LSMHarness [Executor-312:9f94db4ff041223ad40d69a2cb21456b] MERGE operation.afterFinalize failed on {"dir" : "/var/cb-cache/@analytics/v_iodevice_10/storage/partition_26/Database8PrNChAFZ/scope0NbrxYVRQ/remotedatasetmBjsMIbj/0/remotedatasetmBjsMIbj", "memory" : [{"state":"INACTIVE", "writers":0, "readers":0, "pendingFlushes":0, "id":"[349,349]", "index":{"class": "BTree", "file": "storage/partition_26/Database8PrNChAFZ/scope0NbrxYVRQ/remotedatasetmBjsMIbj/0/remotedatasetmBjsMIbj_virtual_0"}}, {"state":"READABLE_WRITABLE", "writers":0, "readers":0, "pendingFlushes":0, "id":"[350,350]", "index":{"class": "BTree", "file": "storage/partition_26/Database8PrNChAFZ/scope0NbrxYVRQ/remotedatasetmBjsMIbj/0/remotedatasetmBjsMIbj_virtual_1"}}], "disk" : 5, "num-scheduled-flushes":0, "current-memory-component":1}
      software.amazon.awssdk.services.s3.model.S3Exception: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: R3SP10WN1SEC03Z4, Extended Request ID: VCrDwu18le+EIIBPrmtfMaoNmZdsgxLEdB/xVBFw2RjJXzQ64AF7AZfmwXVBgLRR4GdSJn46O4A=)
      

      They have continued until

      Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 39AJP84GDAGJEAAF, Extended Request ID: S/be7JzGZevzVV6OgR/pIfSaoNaPZJRnnBPLXAzY2it+1Kp5bd8af6U4YNZCuJwPpi/i63nYKr0=)
      	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 39AK2VGR2DW1AK6Y, Extended Request ID: r9dxi+N8senMxw70MYo3Bkuk8PXitkjOFogXQEDOKJemVeKqUGTDaL7Ir0K2BQLAmaTg/zV8Lbs=)
      	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Please reduce your request rate. (Service: S3, Status Code: 503, Request ID: 6AYXX1AMEY7VAXVS, Extended Request ID: y7IoivgNWfE0uHum/f3VtZgh5tYMRkADNCT7oonBlvYRfobd5sLZ8r/MUljhSk3pJtffuKLZ5Qg=)
      2024-07-18T13:37:54.328+00:00 FATA CBAS.util.ExitUtil [Executor-80:9f94db4ff041223ad40d69a2cb21456b] JVM halting with status 88 (halting thread Thread[Executor-80:9f94db4ff041223ad40d69a2cb21456b,5,main], interrupted false)
      2024-07-18T13:37:54.724+00:00 FATA CBAS.util.ExitUtil [pool-2-thread-1] Thread dump at halt: 
      

      The cluster has become unusable as we see a lot of these messages spanning from 2024-07-18T11:18:44 until 2024-07-18T13:49 -

      2024-07-18T11:18:44.253+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-1] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=4a99209e-2b1c-431e-b267-5e59cbb116f3
      2024-07-18T11:18:45.150+00:00 INFO CBAS.server.QueryServiceServlet [HttpExecutor(port:18095)-2] handleRequest: uuid=259f9295-c90e-4c36-beb6-aa0b7176bc97, clientContextID=null, {"host":"cb.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com:18095","path":"/analytics/service","statement":"<ud>completed_requests()</ud>","pretty":false,"mode":"immediate","clientContextID":null,"clientType":"ASTERIX","dataverse":null,"format":"CLEAN_JSON","timeout":9223372036854775807,"maxResultReads":1,"planFormat":"JSON","expressionTree":false,"rewrittenExpressionTree":false,"logicalPlan":false,"optimizedLogicalPlan":false,"job":false,"profile":"counts","signature":true,"multiStatement":true,"parseOnly":false,"readOnly":false,"maxWarnings":0,"sqlCompat":false,"source":null,"scanConsistency":null,"scanWait":null}
       
      2024-07-18T13:49:32.363+00:00 WARN CBAS.server.QueryServiceServlet [HttpExecutor(port:9110)-15] handleException: ASX0032: Cannot execute request, cluster is UNUSABLE: uuid=null, clientContextID=66ce7c04-e156-4ea9-b1d0-2cc626d43f87
      2024-07-18T13:50:33.952+00:00 WARN CBAS.cbas request to proxy /analytics/node/diagnostics to svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com:9110 failed: Get "https://svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com:9110/analytics/node/diagnostics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      

      Not sure if the root cause is S3 rate limiting or if there's something else going on. But the crux of the problem is that the cluster is unusable.

      cbcollect ->

      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-001.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-002.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-003.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-004.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-005.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-006.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-007.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-008.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-009.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-010.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-011.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-012.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-013.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-014.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-015.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnar2216Jul17/collectinfo-2024-07-18T150026-ns_1%40svc-da-node-016.mkrn3nailcfo0w-b.sandbox.nonprod-project-avengers.com.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            pavan.pb Pavan PB
            pavan.pb Pavan PB
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty