'Python Beautifulsoup get texts before a certain tag
I have the following html code to run a python beautifulsoup to:
<html>
<head>
<script> ... </head>
<title> ... </title>
<style> ... </title>
</head>
<body onload="nextHit()">
S. <a name="hit1"></a><span style="background-color: #FFFF00">NO</span>. 178 H. <a name="hit2"></a><span style="background-color: #FFFF00">NO</span>. 1323 / 46 OG <a name="hit3"></a><span style="background-color: #FFFF00">No</span>. 12, 5977 (December, 1950)
<center>
<h2>...</h2>
<h3>...</h3>
</center>
<br>
....Lines omitted for brevity (more brs, divs, prs)...
</body>
The thing is I only want to get the texts in the beginning of the body tag, just before the first center tag like so:
S. NO. 178 H. NO. 1323 / 46 OG No. 12, 5977 (December, 1950)
I have tried:
ogsourcing = soup.find('center').previousSibling
But I am getting just the last part like so:
. 12, 5977 (December, 1950)
Solution 1:[1]
Version 2; based on OP's comment
find()the<center>element- Use
previous_siblingsto get an iterator with all the siblings - Loop over then, append the
.textto an list - Reverse the list since we're looping from bottom to top
''.join()the list to get the desired string
from bs4 import BeautifulSoup
html = """
<html>
<head>
<script></script>
<title></title>
<style></style>
</head>
<body onload="nextHit()">
S. <a name="hit1"></a><span style="background-color: #FFFF00">NO</span>. 178 H. <a name="hit2"></a><span style="background-color: #FFFF00">NO</span>. 1323 / 46 OG <a name="hit3"></a><span style="background-color: #FFFF00">No</span>. 12, 5977 (December, 1950)
<center>
<h2>foo</h2>
<h3>bar</h3>
</center>
<br>
<em>test</em>
<div>
<em>test</em>
</div>
</body>
</html>
"""
res = []
soup = BeautifulSoup(html, 'html.parser')
for sibling in soup.find('center').previous_siblings:
res.append(sibling.text)
res.reverse()
res = ''.join(res)
print(res)
The above print() will output:
S. NO. 178 H. NO. 1323 / 46 OG No. 12, 5977 (December, 1950)
You might want to include a .strip() to get rid of any whitespaces and/or newlines
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
