'Tag contains text, but also another tag with text. How do I get the text, but not the text within the extra tag with beautifulsoup python?
I have the following tag:
<div class="example_class">
<b>
<img src="image_source"/>
extra infomation
<a href="reference">
extra information
</a>
</b>
.
<br/>
<br/>
This is the text I want to get.
<lt>
</lt>
br /
<gt>
</gt>
<lt>
</lt>
br /
<gt>
</gt>
<lt>
</lt>
br /
<gt>
</gt>
This is the rest of the text.
</div>
I want to get the text 'This is the text I want to get. This is the rest of the text.', but I don't know how. When I try the following:
soup_result = soup.find('div',{'class': 'example_class'})
result = soup_result.get_text()
I get:
'\n\n\n extra information\n \n extra information\n \n\n .\n \n\n This is the text I want to get.\n \n\n br /\n \n\n\n\n br /\n \n\n\n\n br /\n \n\n This is the rest of the text.'
How do I make sure 'extra information' and the newlines with a lot of whitespaces in between are not in the result?
Solution 1:[1]
I'm assuming br / are tags <br />. You can .extract tags you don't want in the result before .get_text():
div = soup.find(class_="example_class")
for b in div.find_all("b"):
b.extract()
text = div.get_text(strip=True, separator="\n")
print(text)
Prints
.
This is the text I want to get.
This is the rest of the text.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Andrej Kesely |
