Details
-
Improvement
-
Resolution: Unresolved
-
Major
-
7.0.0
-
0
Description
TF-IDF takes into account term frequency and inverse document frequency to determine term scoring. So, it rewards term frequency and penalizes document frequency.
BM25 goes beyond this - https://en.wikipedia.org/wiki/Okapi_BM25
Also a good read - https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
From elastic:
- https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch
- https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables
Todo:
- Weigh pros & cons of tf-idf vs bm25 for couchbase
- BM25 score to work better alongside kNN distance (as this is a global score as opposed to tf-idf)
More details:
- Will be an index creation time setting, user to be able to choose between tf-idf and bm-25
- This will entail a file format change to accommodate the new score metrics.
Attachments
Issue Links
- relates to
-
MB-21636 [FTS] Custom Scoring
- Open