'Regex Python based on internal structures if they exists

I have a structure like this

[[word test]] or [[word | word2 ]] or [[word test2 # word]]] ...

I need to extract everything before the # and the |, if they exist, and ignore what is after. If they don't exists, return everything between the braces.

So the results for the examples above will be:

word test
word
word test2

I have

variable = re.findall(r'\[\[(.*?)\]\]', docs[doc], re.IGNORECASE)

but this is not extracting what is before # or |.



Solution 1:[1]

Also, you can try this Regular Expression (demo) too:

r'\[+\s*(.*?)\s*(?:[#|].*?)?]+'

The complete example will be this.

import re

_input = [ 
    '[[word test]]'
    ,'[[word | word2 ]]'
    ,'[[word test2 # word]]' 
]

_re = r'\[+\s*(.*?)\s*(?:[#|].*?)?]+'

output = [ re.findall(_re, _)[0] for _ in _input ]
print(output) # ['word test', 'word', 'word test2']

I hope this work.

Explanation:

'\[+' and ']+' 

Will be focus on the brackets.

\s* (group_necessary) \s* (group_unnecessary)?

The set '(?:)' Will not take in consideration the 'group_unnecessary', rewriting:

\s* (group_necessary) \s* (?:group_unnecessary)?

-

'(.*?)'

Will fetch entirely the 'group_necessary'.

'([#|].*?)?' 

Will fetch '#' or '|' and '(.*?)', translated as the 'group_unnecessary' zero or more time because of the last '?'.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1