Loading...

XML

Word

Printable

Details

Type: Task
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 7.6.0
Component/s: fts
Labels:
- performance
- vector-search
Environment:

Hide
[clusters]
fts =
     172.23.99.211:kv,fts
     172.23.99.39:kv,fts
     172.23.99.40:kv,fts
Each has 48 vcpu... 256 GB Ram

[clients]
hosts =
     172.23.99.210
     172.23.99.212
Each has 48 vcpu... 64 GB Ram

Show
[clusters] fts =      172.23.99.211:kv,fts      172.23.99.39:kv,fts      172.23.99.40:kv,fts Each has 48 vcpu... 256 GB Ram [clients] hosts =      172.23.99.210      172.23.99.212 Each has 48 vcpu... 64 GB Ram

Story Points:
0

Description

We have some result for thput with Q1 queries that is pure KNN queries.
http://showfast.sc.couchbase.com/#/timeline/Linux/jts/vector/Throughput
Vector Search OPEN_AI dataset,Pure Knn, wiki 500K x 1536 dim, 3 node, 1 bucket, 1s, 1c, FTS

setup :

jts_instances.7 test_query_workers.25 that means a 7 main process each having 25 thread

that is 7*25 = 175 con reqs

index_partitions = 3

For a same build .. 7.6.0-2054
http://showfast.sc.couchbase.com/#/runs/jts_fts_throughput_vector_search_1s_1c_openAI_dataset_jts_throughput_atlas_setup_A_multiclient/7.6.0-2054
With each re-run I am seeing a variable thput numbers

Lets take two cases ( from above link) :
Case 1 : thput = 10 q/sec
https://perf.jenkins.couchbase.com/job/atlas-multiclient/57/parameters/
logs :
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-57/172.23.99.211.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-57/172.23.99.39.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-57/172.23.99.40.zip

Case 2: thput = 96 q/sec
https://perf.jenkins.couchbase.com/job/atlas-multiclient/59/parameters/
logs :
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-59/172.23.99.211.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-59/172.23.99.39.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-atlas-multiclient-59/172.23.99.40.zip
Both test have same parameters .. code path of perfrunner gerrit patches everything is same .. except the part that they are scheduled at different time ..

From the perfrunner side .. the only different part could we the choice (more specifically the order) of the query from query list ..
Our querylist has 1000 query and throughout the test we choose the queries randomly from the list and execute it. But this value is not seeded so 1st query of each test would not be same item from the query list .. But querylist itself is constant. and if out thput lets say is 50 q/sec and the test phase ran for 600 sec then in total we did 30K queries which is 30x of the size of query list , so it is unlikely we are doing same query all the time.

On looking into the cbmonitor graphs we see that :
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=fts_760-2054_run_test_be2a&label=case_1&snapshot=fts_760-2054_run_test_6bca&label=case_2

The cpu utilisation is continuously high (around 100%) in case 2 vs it is just at case 1 (just 10%) ..
total_bytes_query_results is high in case 2 vs it is just at case 1

in_bytes_per_sec and out_bytes_per_sec is also very high in case 2..

and there is a lot other variations in graphs ....

so some questions:

Why the cpu and other parameters is so low throughout the test in case 1.
If we are hitting good code path or good indexing setup in 2 case and bad in other what are those cases and how are these getting triggered ?
Why are some queries slower than others when value of k is fixed?
Do some index formed with one doc is different then it is formed with other doc (or in other words will doc1 will take more cpu when queried than doc2 if required nearest neighbours (k) are same?
Our end goal here should be stabilize the thput numbers for same build when re-ran otherwise it would be impossible to benchmark thput or find any regression in future.

I am uploading the querylist as txt file . Please ignore the first 2 entries in each line , Actual vector query starts from third entry of each line .. File has a total of 1000 lines representing 1000 queries.

Attachments

Issue Links

relates to

MB-60565 Vector Search Throughput regressed.

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Abhi Dangeti

Reporter:: Devansh Srivastava

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 31/Jan/24 8:14 AM

Updated:: 02/Feb/24 4:22 AM

Gerrit Reviews

There are no open Gerrit changes

Investigate variation in vector search throughput (as reported on showfast)

Details

Description

total_bytes_query_results is high in case 2 vs it is just at case 1

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty