'How to Extract Words Following a Key Word
I'm currently trying to extract 4 words after "our", but keep getting words after "hour" and "your" as well.
i.e.) "my family will send an email in 2 hours when we arrive at." (text in the column)
What I want: nan (since there is no "our")
What I get: when we arrive at (because hour as "our" in it)
I tried the following code and still have no luck.
our = 'our\W+(?P<after>(?:\w+\W+){,4})'
Reviews_C['Review_for_Fam'] =Reviews_C.ReviewText2.str.extract(our, expand=True)
Can you please help?
Thank you!
Solution 1:[1]
You need to make sure "our" is with space boundaries, like this:
our = '(^|\s+)our(\s+)?\W+(?P<after>(?:\w+\W+){,4})'
specifically (^|\s+)our(\s+)? is where you need to play, the example only handles spaces and start of sentence, but you might need to extend this to have quotes or other special characters.
Solution 2:[2]
Im suprised to see regex used for this due to it causing unneeded complexity sometimes. Could something like this work?
def extract_next_words(sentence):
# split the sentence into words
words = sentence.split()
# find the index of "our"
index = words.index("our")
# extract the next 4 words
next_words = words[index+1:index+5]
# join the words into a string
return " ".join(next_words)
Solution 3:[3]
Here is the generic code for finding the n number of words after a specific 'x' word in the string. It also accounts for multiple occurrences of 'x' as well as for non-occurrence.
def find_n_word_after_x(in_str, x, n):
in_str_wrds = in_str.strip().split()
x = x.strip()
if x in in_str_wrds:
out_lst = []
for i, i_val in enumerate(in_str_wrds):
if i_val == x:
if i+n < len(in_str_wrds):
out_str = in_str_wrds[i+1:i+1+n]
out_lst.append(" ".join(out_str))
return out_lst
else:
return []
str1 = "our w1 w2 w3 w4 w5 w6"
str2 = "our w1 w2 our w3 w4 w5 w6"
str3 = "w1 w2 w3 w4 our w5 w6"
str4 = "w1"
print(find_n_word_after_x(str1, 'our', 4))
print(find_n_word_after_x(str2, 'our', 4))
print(find_n_word_after_x(str3, 'our', 4))
print(find_n_word_after_x(str4, 'our', 4))
Generated Output:
['w1 w2 w3 w4']
['w1 w2 our w3', 'w3 w4 w5 w6']
[]
[]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Antoine Baqain |
| Solution 2 | PCDSandwichMan |
| Solution 3 | Bhiman Kumar Baghel |
