'What is the most efficient way to use BM25 in Python as a search engine for a DataFrame?

I am looking to use BM25 for NLP search engine. I followed the documentation and I am able to reproduce the expected outcome using my data. Using the their methodology, I can use the following command:

doc_scores = bm25.get_scores(tokenized_query)

...to retrieve an array of scores that correspond to the scores of my corpus. If my corpus was of length 1000, then doc_scores would also be of the same length, where each entry corresponds to a 'similarity score' - the higher the number, the more similar.

My question: I plan to scale this to nearly 500,000 rows - but I am not sure what is an efficient way to retrieve the top k scores from this model?

What I have tried:

First, I created a function that essentially wraps the scores and corpus into a DataFrame, and sorts the list by score, and slices off the top k results:

def search_with_scores(query, corpus, k):
    tokenized_query = query.split(" ")
    doc_scores = bm25.get_scores(tokenized_query)
    tmpDF = pd.DataFrame({
      "score": doc_scores,
      "text": corpus
    })
    tmpDF = tmpDF.sort_values(by='score', ascending=False)
    return tmpDF[:k]
    
query = "Graphics cards are great"
search_with_scores(query, sentences, 3)

An alternative that I saw was mentioned in this article where they essentially used the get_top_n function included in the package, and then compared the strings to the corpus. My understanding is that comparing strings is a relatively expensive process, so that may not be optimal either.

I worry that the process of combining two long arrays, one of which will contain a considerable amount of text, and then sorting them each time will not be scalable. What can I do to improve my code / methodology when it comes to time/space complexity?

Thanks!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source