'Count total number of modal verbs in text

I am trying to create a custom collection of words as shown in the following Categories:

Modal    Tentative    Certainty    Generalizing
Can      Anyhow       Undoubtedly  Generally
May      anytime      Ofcourse     Overall
Might    anything     Definitely   On the Whole
Must     hazy         No doubt     In general
Shall    hope         Doubtless    All in all
ought to hoped        Never        Basically
will     uncertain    always       Essentially
need     undecidable  absolute     Most
Be to    occasional   assure       Every
Have to  somebody     certain      Some
Would    someone      clear        Often
Should   something    clearly      Rarely
Could    sort         inevitable   None
Used to  sorta        forever      Always

I am reading text from a CSV file row by row:

import nltk
import numpy as np
import pandas as pd
from collections import Counter, defaultdict
from nltk.tokenize import word_tokenize

count = defaultdict(int)
header_list = ["modal","Tentative","Certainity","Generalization"]
categorydf = pd.read_csv('Custom-Dictionary1.csv', names=header_list)
def analyze(file):
    df = pd.read_csv(file)
    modals = str(categorydf['modal'])
    tentative = str(categorydf['Tentative'])
    certainity = str(categorydf['Certainity'])
    generalization = str(categorydf['Generalization'])
    for text in df["Text"]:
        tokenize_text = text.split()
        for w in tokenize_text:          
            if w in modals:
                count[w] += 1
                       
analyze("test1.csv")
print(sum(count.values()))
print(count)

I want to find number of Modal/Tentative/Certainty verbs which are present in the above table and in each row in test1.csv, but not able to do so. This is generating words frequency with number.

19
defaultdict(<class 'int'>, {'to': 7, 'an': 1, 'will': 2, 'a': 7, 'all': 2})

See 'an','a' are not present in the table. I want to get No of Model verbs = total modal verbs present in 1 row of test.csv text

test1.csv:

"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
"They convey the content of a communication."
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"

I am stuck and not getting anything. How can I proceed?



Solution 1:[1]

I've solved your task for initial CSV format, could be of cause adopted to XML input if needed.

I've did quite fancy solution using NumPy, that's why solution might be a bit complex, but runs very fast and suitable for large data, even Giga-Bytes.

It uses sorted table of words, also sorts text to count words and sorted-search in table, hence works in O(n log n) time complexity.

It outputs original text line on first line, then Found-line where it lists each found in Tabl word in sorted order with (Count, Modality, (TableRow, TableCol)), then Non-Found-line where it lists non-found-in-table words plus Count (number of occurancies of this word in text).

Also a much simpler (but slower) similar solution is located after the first one.

Try it online!

import io, pandas as pd, numpy as np

# Instead of io.StringIO(...) provide filename.
tab = pd.read_csv(io.StringIO("""
Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))
tabc = np.array(tab.columns.values.tolist(), dtype = np.str_)
taba = tab.values.astype(np.str_)
tabw = np.char.lower(taba.ravel())
tabi = np.zeros([tabw.size, 2], dtype = np.int64)
tabi[:, 0], tabi[:, 1] = [e.ravel() for e in np.split(np.mgrid[:taba.shape[0], :taba.shape[1]], 2, axis = 0)]
t = np.argsort(tabw)
tabw, tabi = tabw[t], tabi[t, :]

texts = pd.read_csv(io.StringIO("""
Text
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
""")).values[:, 0].astype(np.str_)

for i, (a, text) in enumerate(zip(map(np.array, np.char.split(texts)), texts)):
    vs, cs = np.unique(np.char.lower(a), return_counts = True)
    ps = np.searchsorted(tabw, vs)
    unc = np.zeros_like(a, dtype = np.bool_)
    psm = ps < tabi.shape[0]
    psm[psm] = tabw[ps[psm]] == vs[psm]
    print(
        i, ': Text:', text,
        '\nFound:',
        ', '.join([f'"{vs[i]}": ({cs[i]}, {tabc[tabi[ps[i], 1]]}, ({tabi[ps[i], 0]}, {tabi[ps[i], 1]}))'
            for i in np.flatnonzero(psm).tolist()]),
        '\nNon-Found:',
        ', '.join([f'"{vs[i]}": {cs[i]}'
            for i in np.flatnonzero(~psm).tolist()]),
        '\n',
    )

Outputs:

0 : Text: When LIWC was first developed, the goal was to devise an efficient will system
Found: "will": (1, Modal, (6, 0))
Non-Found: "an": 1, "developed,": 1, "devise": 1, "efficient": 1, "first": 1, "goal": 1, "liwc": 1, "system": 1, "the": 1, "to": 1, "was": 2, "when":
 1

1 : Text: Within a few years, it became clear that there are two very broad categories of words
Found: "clear": (1, Certainty, (10, 2))
Non-Found: "a": 1, "are": 1, "became": 1, "broad": 1, "categories": 1, "few": 1, "it": 1, "of": 1, "that": 1, "there": 1, "two": 1, "very": 1, "withi
n": 1, "words": 1, "years,": 1

2 : Text: Content words are generally nouns, regular verbs, and many adjectives and adverbs.
Found: "generally": (1, Generalizing, (0, 3))
Non-Found: "adjectives": 1, "adverbs.": 1, "and": 2, "are": 1, "content": 1, "many": 1, "nouns,": 1, "regular": 1, "verbs,": 1, "words": 1

3 : Text: They convey the content of a communication.
Found:
Non-Found: "a": 1, "communication.": 1, "content": 1, "convey": 1, "of": 1, "the": 1, "they": 1

4 : Text: To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”
Found:
Non-Found: "a": 1, "and": 2, "are:": 1, "back": 1, "content": 1, "dark": 1, "go": 1, "night”": 1, "phrase": 1, "stormy": 1, "the": 2, "to": 2, "was":
 1, "words": 1, "“dark,”": 1, "“it": 1, "“night.”": 1, "“stormy,”": 1

Second solution is implemented in pure Python just for simplicity, only standard python modules io and csv are used.

Try it online!

import io, csv

# Instead of io.StringIO(...) just read from filename.
tab = csv.DictReader(io.StringIO("""Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))

texts = csv.DictReader(io.StringIO("""
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
"""), fieldnames = ['Text'])

tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]

for text in texts:
    cnt, mod = {}, {}
    for word in text.lower().split():
        if word in tabi:
            cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
    print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))

It outputs:

'will': (1, Modal)
'clear': (1, Certainty)
'generally': (1, Generalizing)

I'm reading from StringIO content of CSV, that is to convenience so that code contains everything without need of extra files, for sure in your case you'll need direct files reading, for this you may do same as in next code and next link (named Try it online!):

Try it online!

import io, csv

tab = csv.DictReader(open('table.csv', 'r', encoding = 'utf-8-sig'))
texts = csv.DictReader(open('texts.csv', 'r', encoding = 'utf-8-sig'), fieldnames = ['Text'])

tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]

for text in texts:
    cnt, mod = {}, {}
    for word in text.lower().split():
        if word in tabi:
            cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
    print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1