'How to compute the TF-IDF for a given word in a document using Whoosh?
I have a dataframe including the following columns:
URLFullName Constituency Caucus Province SubjectOfBusiness Intervention time
where each row represents a document (column Intervention) and the other columns are metadata of this document.
I am using Whoosh
to index the documents. My schema is:
def createSchema(indexRoot):
stem_char_ana = StemmingAnalyzer() | CharsetFilter(accent_map)
author = TEXT
constituency = TEXT(stored = True, vector = Frequency, analyzer = stem_char_ana)
caucus = TEXT(stored = True, vector = Frequency, analyzer = stem_char_ana)
province = TEXT(stored = True, vector = Frequency, analyzer = stem_char_ana)
intervention = TEXT(stored = True)
stime = ID(stored = True, unique=True)
schema = Schema(author=author, constituency=constituency,caucus=caucus, intervention=intervention, stime = stime, province=province)
return schema
To create the index, etc:
dirname = "Path to index location"
# Checks for existing index path and creates one if not present
if not os.path.exists(dirname):
os.mkdir(dirname)
print("Creating the Index")
schema = createSchema(dirname)
ix = index.create_in(dirname, schema)
to add documents to index:
with ix.writer() as writer:
for i in df.index:
writer.add_document(author = str(df.loc[i, "URLFullName"]), constituency = str(df.loc[i, "Constituency"]), caucus = str(df.loc[i, "Caucus"]), intervention = str(df.loc[i, "Intervention"]), province = str(df.loc[i, "Province"]), stime = str(df.loc[i, "time"]))
Now my question: How can I get thee tf-idf of a given word in a given document:
What I tried:
def index_search(dirname, search_fields, search_query, verbose=True):
ix = index.open_dir(dirname)
schema = ix.schema
# Create query parser that looks through designated fields in index
og = qparser.OrGroup.factory(0.9)
mp = qparser.MultifieldParser(search_fields, schema, group = og)
# This is the user query
q = mp.parse(search_query)
# Actual searcher, prints top 10 hits
with ix.searcher(weighting=scoring.TF_IDF()) as s:
results = s.search(q, limit = 10)
print("Search Results: ")
found = results.scored_length()
if results.has_exact_length() :
print(f"Scored, {found}, of exactly, {len(results)}, documents")
else:
low = results.estimated_min_length()
high = results.estimated_length()
if verbose:
print(f"Scored, {found}, of between, {low}, and, {high}, documents")
for i in range(found):
print(f"**File: {results[i]['stime']}")
print(f"****score: {str(results[i].score)}")
print(f"****************{results[i]}")
This function is able to return all the document including the search_query
in its search_fields
. It also returns for each result its tf-idf
s.
How can just limit the search for a given document and query search only. So in the entry of this function I give a word and the document and I get its tf-idf
?
Indeed, How can stop the spliting of word when indexing. I mean I have some words like Dan-Albas
I want to get the full-word to be indexed. What I have now: dan alone and albas are alone.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|