'Identify strings having words from two different lists
I have a dataframe with three columns like this:
index string Result
1 The quick brown fox jumps over the lazy dog
2 fast and furious was a good movie
and i have two lists of words like this:
list1 ["over", "dog", "movie"]
list2 ["quick", "brown", "sun", "book"]
I want to identify strings that have at least one word from list1 AND at least one word from list2, such that the result will be as follows:
index string Result
1 The quick brown fox jumps over the lazy dog TRUE
2 fast and furious was a good movie FALSE
Explanation: The first sentence has words from both lists and so the result is TRUE. The second sentence has only one word from list1 and so it has a result of False.
Can we do that with python? I used search techniques from NLTK but i don't know how to combine results from the two lists. Thanks
Solution 1:[1]
Another option is to split the strings and use set.intersection with all in a list comprehension:
s_lists = [set(list1), set(list2)]
df['Result'] = [all(s_lst.intersection(s.split()) for s_lst in s_lists) for s in df['string'].tolist()]
Output:
index string Result
0 1 The quick brown fox jumps over the lazy dog True
1 2 fast and furious was a good movie False
Solution 2:[2]
If your dataframe (with the first two columns) is called df, you can do the following:
df['Result'] = (df['string'].str.contains('|'.join(list1))
& df['string'].str.contains('|'.join(list2)))
The result:
string Result
0 The quick brown fox jumps over the lazy dog True
1 fast and furious was a good movie False
In response to your comment, perhaps the following does what you want:
words = set(list1).union(set(list2))
df['Result_2'] = [[*words.intersection(s.split())] for s in df['string'].tolist()]
The result:
... Result Result_2
... True [dog, quick, brown, over]
... False [movie]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
