'Find the closest that a set of values occur in a list?
I am writing a search algorithm for text documents. The algorithm takes a set of keywords and returns those documents which closely match the provided set of keywords -- like a web search engine.
Right now I have implemented the TF-IDF metric for measuring document relevance, but that metric doesn't consider term concurrence, so it will rate a document that has the terms close to each other equal to one that has them at opposite sides of the document, even though the document with the terms being next to each other is probably the more relevant one.
So I want to find where the query terms occur the closest in a document. This is complicated, because there can be any number of terms involved, and the weighting factors that determine how good a particular placement is are somewhat arbitrary. As a first pass at the problem, here's a bruteforce solution for this:
Given a set of search terms {a, b, c} and a document [a1, _, _, b1, _, c1, b2, _, _, a2, _, a3]:
- Try placing the
aterm ata1,a2anda3; for each of those: - Try placing the
bterm atb1andb2; for each of those: - Try placing the
cterm atc1(and nowhere else, because that's the only occurrence). - For each placement of
{aN, bN, cN}, measuresum([abs(first, second) for first, second in itertools.product(positions, positions)]); lower score is better.
This works, but requires O(n^t * t^2) runtime for each document to be scored, where n is the number of terms in the document, and t is the number of search terms. This becomes infeasible rather quickly, and reminds me of NP-hard problems like the knapsack problem.
What options are there for getting the value of this metric without trying all the combinations? Approximate solutions are okay, as is storing some additional information per document (as long as it's not the precomputed bruteforce values).
Solution 1:[1]
There're standard feedback models that does this... the complexity is still linear in |V|, i.e., O(|VQ|) (V is the vocabulary of top-k documents, and |Q| is the number of query terms which is practically a small number). After you estimate the co-occurrence likelihoods, you simply rerank your initial retrieved list.
The standard feedback model to use is the relevance model.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Debasis |
