'Create list based on HTML tags and plain text

I have this string:

data = "<div> <p> Hi Man, <strong> how are you</strong> today? </p> </div>"

I would like to turn it in the following list:

['', '<div>', ' ', '<p>', ' Hi Man, ', '<strong>', ' how are you', '</strong>', ' today? ', '</p>', ' ', '</div>', '']

At the moment I have this code:

import re

# initializing string
data = "<div> <p> Hi Man, <strong> how are you</strong> today? </p> </div>"
res2 = re.split(r'([<>])', data)

but the result for the HTML tags is not as I described above. Here my result:

['', '<', 'div', '>', ' ', '<', 'p', '>', ' Hi Man, ', '<', 'strong', '>', ' how are you', '<', '/strong', '>', ' today? ', '<', '/p', '>', ' ', '<', '/div', '>', '']

Could you help to fix it? thanks in advance



Solution 1:[1]

Assuming the order of the output does not matter, this would work.

import re
s = "<div> <p> Hi Man, <strong> how are you</strong> today? </p> </div>"
ptrn1 = re.compile(r"(<.*?>)")
ptrn2 = re.compile(r">([\w\d\s].*?)<")
tags = re.findall(ptrn1, s)
text = re.findall(ptrn2, s)
print(tags)
print(text)
print(tags+text)

Here's the output.

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dhiwakar Ravikumar