'Splitting words by whitespace without affecting brackets content using regex

I'm trying to tokenize sentences using re in python like an example mentioned here:

I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]

I wish to tokenize by splitting them using whitespace but without affecting the bracket set. For example, I want the split list as:

["I", "want", "a", "(hot chocolate)[food]", "and", "(two)[quantity]", "boxes", "of", "(crispy bacon)[food]"]

How do I write the re.split expression to achieve the same.



Solution 1:[1]

You can do this with the regex pattern: \s(?!\w+\))

import re
s = """I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"""
print(re.split(r'\s(?!\w+\))',s))
# ['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']

\s(?!\w+\))
The above pattern will NOT match any space that is followed by a word and a ), basically any space inside ')'.

Test regex here: https://regex101.com/r/SRHEXO/1

Test python here: https://ideone.com/reIIcU

EDIT: Answer to the question from your comment:

Since your input has multiple words inside ( ), you can change the pattern to [\s,](?![\s\w]+\)) Test regex here: https://regex101.com/r/Ea9XlY/1

Solution 2:[2]

Regular expressions, no matter how clever, are not always the right answer.

def split(s):
    result = []
    brace_depth = 0
    temp = ''
    for ch in s:
        if ch == ' ' and brace_depth == 0:
            result.append(temp[:])
            temp = ''
        elif ch == '(' or ch == '[':
            brace_depth += 1
            temp += ch
        elif ch == ']' or ch == ')':
            brace_depth -= 1
            temp += ch
        else:
            temp += ch
    if temp != '':
        result.append(temp[:])
    return result
>>> s="I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"
>>> split(s)
['I', 'want', 'a', '(hot chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy bacon)[food]']

Solution 3:[3]

The regex for string is \s. So using this with re.split:

print(re.split("[\s]", "I want a (hot chocolate)[food] and (two)[quantity] boxes of (crispy bacon)[food]"))

The output is ['I', 'want', 'a', '(hot', 'chocolate)[food]', 'and', '(two)[quantity]', 'boxes', 'of', '(crispy', 'bacon)[food]']

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Chris
Solution 3 anosha_rehan