'Using regular expressions for manipulating tags in python RE package
Input text file contains:
<html>
<header>
<title>This is a title</title>
</header>
<body>
<div>This is a div <div>This is a nested div</div></div>
</body>
</html>
and i want to output to another text file the following :
<l>
<r>
<e>This is a title</e>
</r>
<y>
<v>This is a div <v>This is a nested div</v></v>
</y>
</l>
Using Regex in python how do i do this ? update !!!! I have tried for <> like this:
import re
def run():
with open('input.txt') as f:
fout = open('output.txt', 'w')
count = 0
for line in f:
if not line:
continue
pat = re.findall('<[a-zA-Z]+>',line)
for l in pat:
y = re.sub('<[a-zA-Z]+>', '<{}>'.format(l[-2]), line, count=0, flags=0)
fout.write(y)
Solution 1:[1]
I hope it's not too late to provide a possible solution to this. Here's my code:
import re
def run():
f = """<html>
<tag>bruh</tag>
<a><bro>text here</bro></a>
</html>
"""
g = ""
while g != f:
g = f
f = re.sub(r'<(.+?)(\w)>([\w\W\n\r]*)</\1\2>', r'<\2>\3</\2>', f)
print(f)
run()
Output:
<l>
<r>
<e>This is a title</e>
</r>
<y>
<v>This is a div <v>This is a nested div</v></v>
</y>
</l>
I keep using the same substitution function until there are no more substitutions possible, indicated by g != f. i.e. Until the substituted text and the main text is the same.
Note: I'm primarily a Java user, and have used Python maybe 5 times in the past. This isn't an excuse to justify a (most likely) wrong answer, but as a warning that there might be a few errors in specific cases I'm unaware of.
Solution 2:[2]
from bs4 import BeautifulSoup
with open('xml-sample.html', 'r') as f:
html = f.read()
soup = BeautifulSoup(html, 'lxml')
for tag in soup.find_all('div'):
for child in tag.find_all(recursive=False):
child.name = 'v'
child.unwrap()
for tag in soup.find_all('html'):
tag.name = 'l'
for tag in soup.find_all('header'):
tag.name = 'r'
for tag in soup.find_all('title'):
tag.name = 'e'
for tag in soup.find_all('body'):
tag.name = 'y'
print(soup.prettify())
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Chris |
