Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60609

Investigate variation in vector search throughput (as reported on showfast)

    XMLWordPrintable

Details

    • Task
    • Resolution: Unresolved
    • Critical
    • None
    • 7.6.0
    • fts
    • 0

    Description

      We have some result for thput with Q1 queries that is pure KNN queries.
      http://showfast.sc.couchbase.com/#/timeline/Linux/jts/vector/Throughput 
      Vector Search OPEN_AI dataset,Pure Knn, wiki 500K x 1536 dim, 3 node, 1 bucket, 1s, 1c, FTS

      setup :

      jts_instances.7 test_query_workers.25 that means a 7 main process each having 25 thread 
      that is 7*25 = 175 con reqs
      index_partitions = 3

      For a same build .. 7.6.0-2054 
      http://showfast.sc.couchbase.com/#/runs/jts_fts_throughput_vector_search_1s_1c_openAI_dataset_jts_throughput_atlas_setup_A_multiclient/7.6.0-2054 
      With each re-run I am seeing a variable thput numbers

      Lets take two cases ( from above link) : 
      Case 1 : thput = 10 q/sec
      https://perf.jenkins.couchbase.com/job/atlas-multiclient/57/parameters/ 
      logs : 
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-57/172.23.99.211.zip 
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-57/172.23.99.39.zip 
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-57/172.23.99.40.zip 

      Case 2: thput = 96 q/sec
      https://perf.jenkins.couchbase.com/job/atlas-multiclient/59/parameters/ 
      logs : 
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-59/172.23.99.211.zip 
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-59/172.23.99.39.zip 
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-59/172.23.99.40.zip 
      Both test have same parameters .. code path of perfrunner gerrit patches everything is same .. except the part that they are scheduled at different time .. 

      From the perfrunner side .. the only different part could we the choice (more specifically the order) of the query from query list .. 
      Our querylist has 1000 query and throughout the test we choose the queries randomly from the list and execute it. But this value is not seeded so 1st query of each test would not be same item from the query list .. But querylist itself is constant. and if out thput lets say is 50 q/sec and the test phase ran for 600 sec then in total we did 30K queries which is 30x of the size of query list , so it is unlikely we are doing same query all the time.

      On looking into the cbmonitor graphs we see that : 
      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=fts_760-2054_run_test_be2a&label=case_1&snapshot=fts_760-2054_run_test_6bca&label=case_2 

      1. The cpu utilisation is continuously high (around 100%) in case 2 vs it is just at case 1 (just 10%) .. 
      2. total_bytes_query_results is high in case 2  vs it is just at case 1

      1. in_bytes_per_sec and out_bytes_per_sec is also very high in case 2.. 

      and there is a lot other variations in graphs .... 

      so some questions:

      1. Why the cpu and other parameters is so low throughout the test in case 1.
      2. If we are hitting good code path or good indexing setup in 2 case and bad in other what are those cases and how are these getting triggered ?
      3. Why are some queries slower than others when value of k is fixed?
      4. Do some index formed with one doc is different then it is formed with other doc (or in other words will doc1 will take more cpu when queried than doc2 if required nearest neighbours (k) are same?
        Our end goal here should be stabilize the thput numbers for same build when re-ran otherwise it would be impossible to benchmark thput or find any regression in future.

        I am uploading the querylist as txt file . Please ignore the first 2 entries in each line , Actual vector query starts from third entry of each line .. File has a total of 1000 lines representing 1000 queries.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              abhinav Abhi Dangeti
              devansh.srivastava Devansh Srivastava
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty