'Regular Expression : string - '<a><b>c', What pattern makes output '<b>c' ? Pattern '<.*?>c' doesn't work

Sorry. I am not good at English.

string : '<a><b>c'

pattern 1 : '<.*?>' --> expect & output same : '<a>' and '<b>'. OK.

pattern 2 : '<.*?>c' --> expect : '<b>c'. But, output : '<a><b>c' . Why?

I don't know what pattern makes output '<b>c'.

Please, help me.

Note that I am trying to parsing html by python.



Solution 1:[1]

Your pattern is behaving as greedy, you can prevent such instances by limiting the chars that can be matched by the regex engine, like this: <[^<>]*>c

import re
s = """<a><b>c"""
print(re.findall(r"<[^<>]*>c", s))
# ['<b>c']

The pattern <[^<>]*>c says that match any char that is not angle brackets.

Although the best way to parse html in python is by using libraries like Beautiful soup or other external libraries.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 anotherGatsby