'Python Beautifulsoup get texts before a certain tag

I have the following html code to run a python beautifulsoup to:

<html>
<head>  
<script> ... </head>
<title> ... </title>
<style> ... </title>
</head>

<body onload="nextHit()">
S. <a name="hit1"></a><span style="background-color: #FFFF00">NO</span>. 178 H. <a name="hit2"></a><span style="background-color: #FFFF00">NO</span>. 1323 / 46 OG <a name="hit3"></a><span style="background-color: #FFFF00">No</span>. 12, 5977 (December, 1950)
<center>
<h2>...</h2>
<h3>...</h3>    
</center>
<br>
....Lines omitted for brevity (more brs, divs, prs)...
</body> 

The thing is I only want to get the texts in the beginning of the body tag, just before the first center tag like so:

S. NO. 178 H. NO. 1323 / 46 OG No. 12, 5977 (December, 1950)
        

I have tried:

ogsourcing = soup.find('center').previousSibling

But I am getting just the last part like so:

. 12, 5977 (December, 1950)


Solution 1:[1]

Version 2; based on OP's comment


  1. find() the <center> element
  2. Use previous_siblings to get an iterator with all the siblings
  3. Loop over then, append the .text to an list
  4. Reverse the list since we're looping from bottom to top
  5. ''.join() the list to get the desired string
from bs4 import BeautifulSoup

html = """
<html>
    <head>
        <script></script>
        <title></title>
        <style></style>
    </head>

    <body onload="nextHit()">
        S. <a name="hit1"></a><span style="background-color: #FFFF00">NO</span>. 178 H. <a name="hit2"></a><span style="background-color: #FFFF00">NO</span>. 1323 / 46 OG <a name="hit3"></a><span style="background-color: #FFFF00">No</span>. 12, 5977 (December, 1950)
        <center>
        <h2>foo</h2>
        <h3>bar</h3>
        </center>
        <br>
        <em>test</em>
        <div>
            <em>test</em>
        </div>
    </body>
</html>
"""

res = []
soup = BeautifulSoup(html, 'html.parser')

for sibling in soup.find('center').previous_siblings:
    res.append(sibling.text)

res.reverse()
res = ''.join(res)

print(res)

The above print() will output:

S. NO. 178 H. NO. 1323 / 46 OG No. 12, 5977 (December, 1950)

You might want to include a .strip() to get rid of any whitespaces and/or newlines

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1