'What is the most efficient way to use BM25 in Python as a search engine for a DataFrame?
I am looking to use BM25 for NLP search engine. I followed the documentation and I am able to reproduce the expected outcome using my data. Using the their methodology, I can use the following command:
doc_scores = bm25.get_scores(tokenized_query)
...to retrieve an array of scores that correspond to the scores of my corpus. If my corpus was of length 1000, then doc_scores would also be of the same length, where each entry corresponds to a 'similarity score' - the higher the number, the more similar.
My question: I plan to scale this to nearly 500,000 rows - but I am not sure what is an efficient way to retrieve the top k scores from this model?
What I have tried:
First, I created a function that essentially wraps the scores and corpus into a DataFrame, and sorts the list by score, and slices off the top k results:
def search_with_scores(query, corpus, k):
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)
tmpDF = pd.DataFrame({
"score": doc_scores,
"text": corpus
})
tmpDF = tmpDF.sort_values(by='score', ascending=False)
return tmpDF[:k]
query = "Graphics cards are great"
search_with_scores(query, sentences, 3)
An alternative that I saw was mentioned in this article where they essentially used the get_top_n function included in the package, and then compared the strings to the corpus. My understanding is that comparing strings is a relatively expensive process, so that may not be optimal either.
I worry that the process of combining two long arrays, one of which will contain a considerable amount of text, and then sorting them each time will not be scalable. What can I do to improve my code / methodology when it comes to time/space complexity?
Thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
