'Using regular expressions for manipulating tags in python RE package

Input text file contains:

<html>
<header>
<title>This is a title</title>
</header>
<body>
        <div>This is a div <div>This is a nested div</div></div>
</body>
</html>

and i want to output to another text file the following :

<l>
<r>
<e>This is a title</e>
</r>
<y>
        <v>This is a div <v>This is a nested div</v></v>
</y>
</l>

Using Regex in python how do i do this ? update !!!! I have tried for <> like this:

import re
def run():
    with open('input.txt') as f:
        fout  = open('output.txt', 'w')
        count = 0
        for line in f:
            if not line:
                continue
            pat = re.findall('<[a-zA-Z]+>',line)
            for l in pat:
                y = re.sub('<[a-zA-Z]+>', '<{}>'.format(l[-2]), line, count=0, flags=0)
                fout.write(y)


Solution 1:[1]

I hope it's not too late to provide a possible solution to this. Here's my code:

import re

def run():

    f = """<html>
<tag>bruh</tag>
<a><bro>text here</bro></a>
</html>
"""
    g = ""

    while g != f:
        g = f
        f = re.sub(r'<(.+?)(\w)>([\w\W\n\r]*)</\1\2>', r'<\2>\3</\2>', f)

    print(f)

run()

Output:

<l>
<r>
<e>This is a title</e>
</r>
<y>
        <v>This is a div <v>This is a nested div</v></v>
</y>
</l>

I keep using the same substitution function until there are no more substitutions possible, indicated by g != f. i.e. Until the substituted text and the main text is the same.

Note: I'm primarily a Java user, and have used Python maybe 5 times in the past. This isn't an excuse to justify a (most likely) wrong answer, but as a warning that there might be a few errors in specific cases I'm unaware of.

Solution 2:[2]

from bs4 import BeautifulSoup

with open('xml-sample.html', 'r') as f:
    html = f.read()
soup = BeautifulSoup(html, 'lxml')

for tag in soup.find_all('div'):
    for child in tag.find_all(recursive=False):
        child.name = 'v'
    child.unwrap()

for tag in soup.find_all('html'):
    tag.name = 'l'

for tag in soup.find_all('header'):
    tag.name = 'r'

for tag in soup.find_all('title'):
    tag.name = 'e'

for tag in soup.find_all('body'):
    tag.name = 'y'
print(soup.prettify())

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Chris