'Tag contains text, but also another tag with text. How do I get the text, but not the text within the extra tag with beautifulsoup python?

I have the following tag:

<div class="example_class">
 <b>
 <img src="image_source"/>
          extra infomation
          <a href="reference">
           extra information
          </a>
 </b>
         .
         <br/>
 <br/>
         This is the text I want to get.
         <lt>
 </lt>
         br /
         <gt>
</gt>
<lt>
</lt>
         br /
         <gt>
</gt>
<lt>
</lt>
         br /
         <gt>
</gt>
         This is the rest of the text.
</div>

I want to get the text 'This is the text I want to get. This is the rest of the text.', but I don't know how. When I try the following:

soup_result = soup.find('div',{'class': 'example_class'})
result = soup_result.get_text()

I get:

'\n\n\n         extra information\n         \n          extra information\n         \n\n        .\n        \n\n        This is the text I want to get.\n        \n\n        br /\n        \n\n\n\n        br /\n        \n\n\n\n        br /\n        \n\n        This is the rest of the text.'

How do I make sure 'extra information' and the newlines with a lot of whitespaces in between are not in the result?



Solution 1:[1]

I'm assuming br / are tags <br />. You can .extract tags you don't want in the result before .get_text():

div = soup.find(class_="example_class")

for b in div.find_all("b"):
    b.extract()

text = div.get_text(strip=True, separator="\n")
print(text)

Prints

.
This is the text I want to get.
This is the rest of the text.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrej Kesely