'Splitting words by whitespace without affecting brackets content using regex
I'm trying to tokenize sentences using re in python like an example mentioned here:
I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]
I wish to tokenize by splitting them using whitespace but without affecting the bracket set. For example, I want the split list as:
["I", "want", "a", "(hot chocolate)[food]", "and", "(two)[quantity]", "boxes", "of", "(crispy bacon)[food]"]
How do I write the re.split expression to achieve the same.
Solution 1:[1]
You can do this with the regex pattern: \s(?!\w+\))
import re
s = """I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"""
print(re.split(r'\s(?!\w+\))',s))
# ['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']
\s(?!\w+\))
The above pattern will NOT match any space that is followed by a word and a ), basically any space inside ')'.
Test regex here: https://regex101.com/r/SRHEXO/1
Test python here: https://ideone.com/reIIcU
EDIT: Answer to the question from your comment:
Since your input has multiple words inside ( ), you can change the pattern to [\s,](?![\s\w]+\))
Test regex here: https://regex101.com/r/Ea9XlY/1
Solution 2:[2]
Regular expressions, no matter how clever, are not always the right answer.
def split(s):
result = []
brace_depth = 0
temp = ''
for ch in s:
if ch == ' ' and brace_depth == 0:
result.append(temp[:])
temp = ''
elif ch == '(' or ch == '[':
brace_depth += 1
temp += ch
elif ch == ']' or ch == ')':
brace_depth -= 1
temp += ch
else:
temp += ch
if temp != '':
result.append(temp[:])
return result
>>> s="I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"
>>> split(s)
['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']
Solution 3:[3]
The regex for string is \s. So using this with re.split:
print(re.split("[\s]", "I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"))
The output is ['I', 'want', 'a', '(hot', 'chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy', 'bacon)[food]']
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Chris |
| Solution 3 | anosha_rehan |
