Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60551

FTS Rebalance stuck (or very slow) | Vector search | 7.6.0-2056

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • 7.6.0
    • 7.6.0
    • fts
    • Untriaged
    • 0
    • Unknown

    Description

      The following test was done on Capella with 21 million docs out of which 1 million were vector documents.

      An AWS cluster with  ami: couchbase-cloud-server-7.6.0-2056-x86_64-v1.0.27

      Initial service group configuration : all 6 services colocated on 3 nodes with 16 cores and 32 GB RAM.

      Tried to scale to : 5 nodes with 8 cores and 32 GB RAM each with all services colocated.
      The cluster went __ into an infinite rebalance state and the scaling is stuck at FTS since 13 hours.

      Workload - 
      bucket : Magma bucket with 10GB of available RAM and 21 million(1 million vector docs) documents with a total size of 26GB

      Service wise workload - 

      I had 110 fts indexes out of which 10 were vector indexes. 109 gsi indexes and 31 dataverses ** 

       

      DD logs - 
      https://cb-engineering.s3.amazonaws.com/aman/collectinfo-2024-01-27T071305-ns_1%40svc-dqisea-node-007.zmhlrqrvzgek2dd.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/aman/collectinfo-2024-01-27T071305-ns_1%40svc-dqisea-node-008.zmhlrqrvzgek2dd.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/aman/collectinfo-2024-01-27T071305-ns_1%40svc-dqisea-node-009.zmhlrqrvzgek2dd.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/aman/collectinfo-2024-01-27T071305-ns_1%40svc-dqisea-node-010.zmhlrqrvzgek2dd.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/aman/collectinfo-2024-01-27T071305-ns_1%40svc-dqisea-node-011.zmhlrqrvzgek2dd.sandbox.nonprod-project-avengers.com.zip
       

      cluster is live and can be found here - 
      https://ui.qe-9.sandbox.nonprod-project-avengers.com/database/settings/activity?oid=4f91031a-7d04-4965-aa06-2f9afc837093&pid=466e9a5b-fa4d-41fe-9dbe-efc61940fba2&dbid=a13b3eb1-9ec9-4869-a245-1c23a616097b

       

      NOTE - unlike some of the previous vector search bugs, the cpu utlisation never crossed 80% threshold for any of the nodes.

      although the RAM used by search was pretty high for most part of the test ~25 GiB

      Attachments

        1. image-2024-01-27-12-54-05-564.png
          85 kB
          Aman Srivastava
        2. image-2024-01-27-12-57-42-834.png
          60 kB
          Aman Srivastava
        3. image-2024-01-27-13-00-13-715.png
          111 kB
          Aman Srivastava
        4. image-2024-01-27-13-05-58-831.png
          135 kB
          Aman Srivastava
        5. image-2024-01-27-13-07-32-709.png
          69 kB
          Aman Srivastava
        6. image-2024-01-27-13-14-16-222.png
          49 kB
          Aman Srivastava
        7. image-2024-01-27-13-15-06-826.png
          49 kB
          Aman Srivastava
        8. image-2024-01-27-13-15-38-963.png
          71 kB
          Aman Srivastava
        9. image-2024-01-27-13-16-39-472.png
          68 kB
          Aman Srivastava
        10. image-2024-01-29-11-29-13-930.png
          41 kB
          Aman Srivastava
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            aman.srivastava Aman Srivastava
            aman.srivastava Aman Srivastava
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty