Details
Description
We have some result for thput with Q1 queries that is pure KNN queries.
http://showfast.sc.couchbase.com/#/timeline/Linux/jts/vector/Throughput
Vector Search OPEN_AI dataset,Pure Knn, wiki 500K x 1536 dim, 3 node, 1 bucket, 1s, 1c, FTS
setup :
jts_instances.7 test_query_workers.25 that means a 7 main process each having 25 thread |
that is 7*25 = 175 con reqs |
index_partitions = 3 |
For a same build .. 7.6.0-2054
http://showfast.sc.couchbase.com/#/runs/jts_fts_throughput_vector_search_1s_1c_openAI_dataset_jts_throughput_atlas_setup_A_multiclient/7.6.0-2054
With each re-run I am seeing a variable thput numbers
Lets take two cases ( from above link) :
Case 1 : thput = 10 q/sec
https://perf.jenkins.couchbase.com/job/atlas-multiclient/57/parameters/
logs :
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-57/172.23.99.211.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-57/172.23.99.39.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-57/172.23.99.40.zip
Case 2: thput = 96 q/sec
https://perf.jenkins.couchbase.com/job/atlas-multiclient/59/parameters/
logs :
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-59/172.23.99.211.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-59/172.23.99.39.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-59/172.23.99.40.zip
Both test have same parameters .. code path of perfrunner gerrit patches everything is same .. except the part that they are scheduled at different time ..
From the perfrunner side .. the only different part could we the choice (more specifically the order) of the query from query list ..
Our querylist has 1000 query and throughout the test we choose the queries randomly from the list and execute it. But this value is not seeded so 1st query of each test would not be same item from the query list .. But querylist itself is constant. and if out thput lets say is 50 q/sec and the test phase ran for 600 sec then in total we did 30K queries which is 30x of the size of query list , so it is unlikely we are doing same query all the time.
On looking into the cbmonitor graphs we see that :
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=fts_760-2054_run_test_be2a&label=case_1&snapshot=fts_760-2054_run_test_6bca&label=case_2
- The cpu utilisation is continuously high (around 100%) in case 2 vs it is just at case 1 (just 10%) ..
total_bytes_query_results is high in case 2 vs it is just at case 1
- in_bytes_per_sec and out_bytes_per_sec is also very high in case 2..
and there is a lot other variations in graphs ....
so some questions:
- Why the cpu and other parameters is so low throughout the test in case 1.
- If we are hitting good code path or good indexing setup in 2 case and bad in other what are those cases and how are these getting triggered ?
- Why are some queries slower than others when value of k is fixed?
- Do some index formed with one doc is different then it is formed with other doc (or in other words will doc1 will take more cpu when queried than doc2 if required nearest neighbours (k) are same?
Our end goal here should be stabilize the thput numbers for same build when re-ran otherwise it would be impossible to benchmark thput or find any regression in future.
I am uploading the querylist as txt file . Please ignore the first 2 entries in each line , Actual vector query starts from third entry of each line .. File has a total of 1000 lines representing 1000 queries.
Attachments
Issue Links
- relates to
-
MB-60565 Vector Search Throughput regressed.
- Closed