'How to do frequency test (occurrence rating) of each text in different paragraphs? (Python, pandas)

import numpy as np
import pandas as pd


d1 = {'keyword': ["apple", "orange", "banana", "strawberry", "pear"]}
df1 = pd.DataFrame(d1)
df1

d2 = {'text1': ["apple with apple with apple with orange", "orange with orange with strawberry", "banana and banana and banana and banana", "strawberry apple banana strawberry strawberry", "pear and pear"], 'chapter1_no': [1,2,3,4,5]}
df2 = pd.DataFrame(d2)
df2

d3 = {'text2': ["another banana with pear and banana", "orange and banana is orange", "strawberry with strawberry is double strawberry not banana", "apple and apple and samsung", "pear pair up fairly"], 'chapter2_no': ["a","b","c","d","e"]}
df3 = pd.DataFrame(d3)
df3

Let's say that I have those 3 data frames and I want to run this frequency test of "how many times each keyword appears in different chapters for both texts?" and ultimately see the correlation between the two chapters. ex) Since "apple" appears most frequently in text1-chapter 1 and text2-chapter "d", we conclude that chapter 1 and "d" are correlated.

What is the most "pythonic" way to code the for-loop for this problem? I'd like to follow this framework which is:

First start comparing keyword and df2

  1. for a keyword i, check how many times i appears in chapter1_no 1
  2. for-loop process 1 for 5 chapters (from chapter1_no = 1 to chapter1_no = 5)
  3. for-loop the process 1 to 2 for every keyword i
  4. have a result that looks something like {i1: 3, 0, 0, 1, 0}, {i2: ~} ...
  5. store this as a data frame where each rows are i1, i2, i3... and each columns are chapter1_no, and the values are the frequencies

Do the same thing with keyword and df3

Do the correlation test with chapters in df2 and df3 (don't know how at this point rather than manually look at each chapters' most frequent keyword).

I would like to keep it simple in for-loop format, but if you have any suggestion anything else than moving away from the for-loop, feel free to write down!

Thank you in advance.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source