'How to parse multiple <ul><li> html with Python?
I have a document like this:
TEXT
TEXT
<ul>
<li>1</li>
<ul>
<li>2</li>
<li>3</li>
</ul>
<li>4</li>
</ul>
ANOTHER TEXT
What can I use to transform it into:
TEXT
TEXT
* 1
** 2
** 3
* 4
ANOTHER TEXT
I need to parse the ul/li parts only, TEXT (it doesn't have ul/li) should be left intact without any changes.
I wrote a parser
def uls(str):
str = re.sub(r'<li>(.*?)</li>', r"<li><!!\1></li>", str, flags=re.M | re.U | re.MULTILINE | re.DOTALL)
ret_text = []
ul_level = 0
text = ''
pattern = re.compile(r'(<.*?>)')
for tag in re.findall(pattern, str):
if tag == '<ul>':
ul_level += 1
if tag == '</ul>':
ul_level -= 1
if ul_level == 0:
ret_text.append(text)
text = ''
if re.search(r'<!!(.*?)>', tag, re.M | re.U | re.MULTILINE | re.DOTALL):
text = text + ('*' * ul_level) + re.sub(r'<!!(.*?)>', r' \1\n', tag, re.M | re.U | re.MULTILINE | re.DOTALL)
return ret_text
It's produces correct array, but how can I replace
- ...
Solution 1:[1]
First and foremost, don't parse html with regex!; use a proper parser. Second, even with a proper parser it's going to be difficult to get you to your expected output. The following (admittedly, somewhat hackish) should be you close enough...
import lxml.html as lh #you'll have to read up on lxml/xpath...
ht = """<html>TEXT1
TEXT2
<ul>
<li>1</li>
<ul>
<li>2</li>
<li>3</li>
</ul>
<li>4</li>
</ul>
ANOTHER TEXT3
</html>"""
doc = etree.fromstring(ht)
tree = etree.ElementTree(doc)
txts = ['text','tail']
for elem in doc.xpath('//*'):
for txt in txts:
try:
target= eval(f'elem.{txt}').strip()
if target:
#the next line counts the number of tiers and prints the appropriate number of '*'s:
print(tree.getelementpath(elem).count('/') * "*", target)
except:
continue
Output:
TEXT1
TEXT2
ANOTHER TEXT3
* 1
** 2
** 3
* 4
As I said, pretty close.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jack Fleeting |
