'Filter extracted text by html tag with selenium python
I have to work with an html structure that looks like this:
<div class='divClass'>
<h5>Article 1</h5>
<p>Paragraph one written in 2022</p>
<p>(1) This <sup>1</sup>paragraph <sup>2</sup>has footnotes.</p>
<p>This paragraph has a different <a class='footnotelink'>3</a>footnote.</p>
</div>
I need to extract the text from this div, but the footnotes have to be filtered out.
Here are some more details about the structure:
- There can be 0 or many
<p>tags - Each
<p>tag may or may not contain footnotes of any type - Each
<p>tag can contain desirable numbers that should not be removed - The
<h5>can be replaced by<h4> - Footnotes can be in
<sup>tags or in<a>tags with class'footnotelink'
If I use driver.find_element(By.CLASS_NAME, 'divClass').text I receive the unfiltered version which looks like this:
Article 1\nParagraph one written in 2022\n(1) This 1paragraph 2has footnotes.\nThis paragraph has a different 3footnote.
What I need is this:
Article 1\nParagraph one written in 2022\n(1) This paragraph has footnotes.\nThis paragraph has a different footnote.
I can't simply filter out numbers because they may appear in the text outside of a footnote.
This question is similar but filters out text of all text nodes instead of only specific ones.
Edit: Specified that <p> tags can contain desirable numbers
Solution 1:[1]
What you can do here is:
- Get the entire text.
- Get the
pelements containingsuporawithfootnotelinkclass name texts. - From the former remove numbers.
- From the entire text replace texts received in step 2 by texts received in step 3, as following:
entire_text = driver.find_element(By.CLASS_NAME, 'divClass').text
psups = driver.find_element(By.XPATH, "//div[@class='divClass']//p[.//sup]")
pas = driver.find_element(By.XPATH, "//div[@class='divClass']//p[.//a[@class='footnotelink']]")
sub_texts = []
for ps in psups:
sub_texts.append(ps.text)
for pa in pas:
sub_texts.append(pa.text)
sut_text_cleaned = []
for sub in sub_texts:
res = ''.join([i for i in sub if not i.isdigit()])
sut_text_cleaned.append(res)
for i in range(len(sub_texts)):
entire_text.replace(sub_texts[i], sut_text_cleaned[0])
Solution 2:[2]
With regex
import re
article = driver.find_element(By.CLASS_NAME, 'divClass').text
article = re.sub(r'\d{1,}footnote', 'footnote', article)
print(article)
\d{1,}footnote means multiple (one ore more) digits just before the footnote.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Prophet |
| Solution 2 | Max Daroshchanka |
