'combining consecutive matches and separating non-consecutive matches with from re findall
I have a string of the format:
my_string = 'hello|foo world|foo how|bar are|bar you|bar today|foo'
I want to return a list where all consecutive words that are followed by foo are grouped together in the same string, but words with a '|bar' word in between are in separate strings. If I try a lookahead with repetition:
re.findall(r'(\w+(?=\|foo\b))+',my_string)
returns
['hello', 'world', 'today']
but I would like to return is
['hello world', 'today']
Because 'hello' and 'world' are not separated by a non-foo word.
In my real problem, the number of times sequences of words followed by 'foo' will appear in the string being searched is unknown, and 'bar' could be several different patterns.
I could solve it with a couple replaces, first replacing all non-foo patterns with a split indicator and splitting on that, then removing the foos and stripping spaces:
bars_removed = re.sub('(\w+\|(?!foo)[a-z]+ )+','split_string',my_string)
only_foo_words = [re.sub('\|foo','',x).strip() for x in bars_removed.split('split_string')]
which returns the desired result, but I feel like there's a way to do this using findall or maybe finditer that I'm missing.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
