'Filter extracted text by html tag with selenium python

I have to work with an html structure that looks like this:

<div class='divClass'>
    <h5>Article 1</h5>
    <p>Paragraph one written in 2022</p>
    <p>(1) This <sup>1</sup>paragraph <sup>2</sup>has footnotes.</p>
    <p>This paragraph has a different <a class='footnotelink'>3</a>footnote.</p>
</div>

I need to extract the text from this div, but the footnotes have to be filtered out.

Here are some more details about the structure:

  • There can be 0 or many <p> tags
  • Each <p> tag may or may not contain footnotes of any type
  • Each <p> tag can contain desirable numbers that should not be removed
  • The <h5> can be replaced by <h4>
  • Footnotes can be in <sup> tags or in <a> tags with class 'footnotelink'

If I use driver.find_element(By.CLASS_NAME, 'divClass').text I receive the unfiltered version which looks like this:

Article 1\nParagraph one written in 2022\n(1) This 1paragraph 2has footnotes.\nThis paragraph has a different 3footnote.

What I need is this:

Article 1\nParagraph one written in 2022\n(1) This paragraph has footnotes.\nThis paragraph has a different footnote.

I can't simply filter out numbers because they may appear in the text outside of a footnote.

This question is similar but filters out text of all text nodes instead of only specific ones.

Edit: Specified that <p> tags can contain desirable numbers



Solution 1:[1]

What you can do here is:

  1. Get the entire text.
  2. Get the p elements containing sup or a with footnotelink class name texts.
  3. From the former remove numbers.
  4. From the entire text replace texts received in step 2 by texts received in step 3, as following:
entire_text = driver.find_element(By.CLASS_NAME, 'divClass').text
psups = driver.find_element(By.XPATH, "//div[@class='divClass']//p[.//sup]")
pas = driver.find_element(By.XPATH, "//div[@class='divClass']//p[.//a[@class='footnotelink']]")
sub_texts = []
for ps in psups:
    sub_texts.append(ps.text)
for pa in pas:
    sub_texts.append(pa.text)
sut_text_cleaned = []
for sub in sub_texts:
    res = ''.join([i for i in sub if not i.isdigit()])
    sut_text_cleaned.append(res)
for i in range(len(sub_texts)):
    entire_text.replace(sub_texts[i], sut_text_cleaned[0])

Solution 2:[2]

With regex

import re

article = driver.find_element(By.CLASS_NAME, 'divClass').text
article = re.sub(r'\d{1,}footnote', 'footnote', article)
print(article)

\d{1,}footnote means multiple (one ore more) digits just before the footnote.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Prophet
Solution 2 Max Daroshchanka