'Regex Python based on internal structures if they exists
I have a structure like this
[[word test]] or [[word | word2 ]] or [[word test2 # word]]] ...
I need to extract everything before the # and the |, if they exist, and ignore what is after. If they don't exists, return everything between the braces.
So the results for the examples above will be:
word test
word
word test2
I have
variable = re.findall(r'\[\[(.*?)\]\]', docs[doc], re.IGNORECASE)
but this is not extracting what is before # or |.
Solution 1:[1]
Also, you can try this Regular Expression (demo) too:
r'\[+\s*(.*?)\s*(?:[#|].*?)?]+'
The complete example will be this.
import re
_input = [
'[[word test]]'
,'[[word | word2 ]]'
,'[[word test2 # word]]'
]
_re = r'\[+\s*(.*?)\s*(?:[#|].*?)?]+'
output = [ re.findall(_re, _)[0] for _ in _input ]
print(output) # ['word test', 'word', 'word test2']
I hope this work.
Explanation:
'\[+' and ']+'
Will be focus on the brackets.
\s* (group_necessary) \s* (group_unnecessary)?
The set '(?:)' Will not take in consideration the 'group_unnecessary', rewriting:
\s* (group_necessary) \s* (?:group_unnecessary)?
-
'(.*?)'
Will fetch entirely the 'group_necessary'.
'([#|].*?)?'
Will fetch '#' or '|' and '(.*?)', translated as the 'group_unnecessary' zero or more time because of the last '?'.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
