'Create list based on HTML tags and plain text
I have this string:
data = "<div> <p> Hi Man, <strong> how are you</strong> today? </p> </div>"
I would like to turn it in the following list:
['', '<div>', ' ', '<p>', ' Hi Man, ', '<strong>', ' how are you', '</strong>', ' today? ', '</p>', ' ', '</div>', '']
At the moment I have this code:
import re
# initializing string
data = "<div> <p> Hi Man, <strong> how are you</strong> today? </p> </div>"
res2 = re.split(r'([<>])', data)
but the result for the HTML tags is not as I described above. Here my result:
['', '<', 'div', '>', ' ', '<', 'p', '>', ' Hi Man, ', '<', 'strong', '>', ' how are you', '<', '/strong', '>', ' today? ', '<', '/p', '>', ' ', '<', '/div', '>', '']
Could you help to fix it? thanks in advance
Solution 1:[1]
Assuming the order of the output does not matter, this would work.
import re
s = "<div> <p> Hi Man, <strong> how are you</strong> today? </p> </div>"
ptrn1 = re.compile(r"(<.*?>)")
ptrn2 = re.compile(r">([\w\d\s].*?)<")
tags = re.findall(ptrn1, s)
text = re.findall(ptrn2, s)
print(tags)
print(text)
print(tags+text)
Here's the output.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Dhiwakar Ravikumar |

