'Inverted Index Speed Up

I've implemented a inverted index in python, which is essentially a dictionary, whose key is words in the corpus, value is the tuple containing document that the key occurs in together with its bm25 score.

{
"love": [(doc1, 12), (doc3, 7.9), (doc5, 6.5)],
"hate": [(doc2, 8.7), (doc4, 3.2)]
}

However, when I process a query, I find it's hard to benefit from the efficiency of inverted index, because I must iterate all words in the query in a for loop. Within this loop, I must further loop over the documents the word links and maintain a global score table for all documents.

I think this is not the optimal way. Some ideas to speed up? I think a batch dictionary which accepts multiple keys and returns multiple values in parallel would help.



Solution 1:[1]

It should be more efficient if you represent the inverted index as a matrix, particular a sparse matrix, where your rows are your corpus and the columns as each document.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 abckk