'Search values in a Pandas DataFrame with values from another DataFrame
I have 2 dataframes.
df_dora
| content | feature | id | |
|---|---|---|---|
| 1 | cyber hygien | risk management | 1 |
| 2 | cyber risk | risk management | 2 |
| ... | ... | ... ... | |
| 59 | intellig share | information sharing | 63 |
| 60 | inform share | information sharing | 64 |
df_corpus
| content | id | meta.name | meta._split_id | |
|---|---|---|---|---|
| 0 | market grow cyber attack... | 56a2a2e28954537131a4aa734f49e361 | 14_Group_AG_2021 | 0 |
| 1 | sec form file index | 7aedfd4df02687d3dff9897c925da508 | 14_Group_AG_2021 | 1 |
| ... | ... | ... | ... | |
| 213769 | cyber secur alert parent compani fina... | ab10325601597f203f3f0af7aa647112 | 17_La_Banque_2021 | 8581 |
| 213770 | intellig share statement parent compani fina... | 6af5687ac31849d19d2048e0b2ca472d | 17_La_Banque_2021 | 8582 |
I am trying to extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name. I tried to use isin
df = df_corpus[df_corpus.content.isin(df_dora.content)]
len(df)
Returns only 17 rows
| content | id | meta.name | meta | |
|---|---|---|---|---|
| 41474 | incid | a4c478e0fad1b9775c05e01d871b3aaf | 3_Agricole_2021 | 10185 |
| 68690 | oper risk | 2e5139d82c242c89523110cc1110647a | 10_Banking_Group_PLC_2021 | 5525 |
| ... | ... | ... | ... | ... |
| 99259 | risk report | a84eefb9a4772d13eb67f2d6ae5215cb | 31_Building_Society_2021 | 4820 |
| 105662 | risk manag | e8050be841fedb6dd10599e8b4892a9f | 43_Bank_SA_2021 | 131 |
df_corpus.loc[df_corpus.content.isin(df_dora.content), 'content'].tolist()
also returns 17 rows
if I search for 2 of the terms that exist in df_dora directly in df_corpus
resiliency_term = df_corpus.loc[df_corpus['content'].str.contains("cyber risk|inform share", case=False)]
print(resiliency_term)
I get 243 rows (which matches what was in the original file.)
So given the above...my question is this how do I extract a count of each term listed in df_dora.content within df_corpus.content grouped by df_content.meta.name.
Thanks in advance for any help.
Solution 1:[1]
unique_vals = '|'.join(df_dora.content.unique())
df_corpus.groupby('meta.name').apply(lambda x: x.content.str.findall(unique_vals).explode().value_counts())
Output given your four lines of each:
17_La_Banque_2021 intellig share 1
Name: content, dtype: int64
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | BeRT2me |
