'How to compute the TF-IDF for a given word in a document using Whoosh?

I have a dataframe including the following columns:

URLFullName Constituency    Caucus  Province    SubjectOfBusiness   Intervention    time

where each row represents a document (column Intervention) and the other columns are metadata of this document.

I am using Whoosh to index the documents. My schema is:

def createSchema(indexRoot):    

    stem_char_ana = StemmingAnalyzer() | CharsetFilter(accent_map)
    
    author = TEXT
    constituency = TEXT(stored = True, vector = Frequency, analyzer = stem_char_ana) 
    caucus = TEXT(stored = True, vector = Frequency, analyzer = stem_char_ana)
    province = TEXT(stored = True, vector = Frequency, analyzer = stem_char_ana)  
    intervention = TEXT(stored = True)
    stime = ID(stored = True, unique=True)
    schema = Schema(author=author, constituency=constituency,caucus=caucus, intervention=intervention,  stime = stime, province=province)
    return schema

To create the index, etc:

dirname = "Path to index location"
# Checks for existing index path and creates one if not present
if not os.path.exists(dirname):
    os.mkdir(dirname)
print("Creating the Index")

schema = createSchema(dirname)
ix = index.create_in(dirname, schema)

to add documents to index:

with ix.writer() as writer:
    for i in df.index:
        writer.add_document(author = str(df.loc[i, "URLFullName"]), constituency = str(df.loc[i, "Constituency"]), caucus = str(df.loc[i, "Caucus"]), intervention = str(df.loc[i, "Intervention"]), province = str(df.loc[i, "Province"]), stime = str(df.loc[i, "time"]))

Now my question: How can I get thee tf-idf of a given word in a given document:

What I tried:

def index_search(dirname, search_fields, search_query, verbose=True):
    ix = index.open_dir(dirname)
    schema = ix.schema
    
    # Create query parser that looks through designated fields in index
    og = qparser.OrGroup.factory(0.9)
    mp = qparser.MultifieldParser(search_fields, schema, group = og)

    # This is the user query
    q = mp.parse(search_query)

    # Actual searcher, prints top 10 hits
    with ix.searcher(weighting=scoring.TF_IDF()) as s:
        results = s.search(q, limit = 10)
        print("Search Results: ")
        found = results.scored_length()
        
        if results.has_exact_length() :
            print(f"Scored, {found}, of exactly, {len(results)}, documents")
        else:
            low = results.estimated_min_length()
            high = results.estimated_length()
            if verbose:
                print(f"Scored, {found}, of between, {low}, and, {high}, documents")
        
        for i in range(found):
            print(f"**File: {results[i]['stime']}")
            print(f"****score: {str(results[i].score)}")
            print(f"****************{results[i]}")

This function is able to return all the document including the search_query in its search_fields. It also returns for each result its tf-idfs.

How can just limit the search for a given document and query search only. So in the entry of this function I give a word and the document and I get its tf-idf?

Indeed, How can stop the spliting of word when indexing. I mean I have some words like Dan-Albas I want to get the full-word to be indexed. What I have now: dan alone and albas are alone.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source