'BS4, getting exact match between unclosed <br>

from bs4 import BeautifulSoup


html = '''<tbody id="plaintiff-body">
   <tr>
      <td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td>
      <td>JENEE BENNETT</td>
      <td></td>
      <td>COURTNEY L HANNA</td>
   </tr>
   <tr id="pladetail0001" style="" valign="top">
      <td></td>
      <td>2348 WOODBROOK CIR N<br>UNIT D<br>COLUMBUS, OH 43223</td>
      <td></td>
      <td>JOSEPH &amp; JOSEPH CO LPA   <br>SUITE 200<br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>DEBORAH L MCNINCH<br>JOSEPH &amp; JOSEPH CO LPA   <br>THE WATERFORD, SUITE 200 <br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>S K DODDERER<br>155 W MAIN STREET<br>#200<br>COLUMBUS, OH 43215<br>(614) 449-8282</td>
   </tr>
</tbody>'''

soup = BeautifulSoup(html, 'lxml')
att = [x.get_text(strip=True, separator=' ') for x in soup.select(
    '#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')]
print(att)

Current output:

['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282 DEBORAH L MCNINCH JOSEPH & JOSEPH CO LPA THE WATERFORD, SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282 S K DODDERER 155 W MAIN STREET #200 COLUMBUS, OH 43215 (614) 449-8282']

Desired Output:

['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']

How to achieve that ?

I'm thinking to use https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function to pass a function and loop overn the match and once i found br empty then i will stop the loop.

Otherwise i can get x itself instead of x.get_text() and then split on >< to get the first index and then use https://w3lib.readthedocs.io/en/latest/w3lib.html?highlight=remove#w3lib.html.remove_tags

Happy to know if there a direct solution with CSS or a simple one.



Solution 1:[1]

Another version:

import re

for br in soup.select("br"):
    br.replace_with("\n")

out = [
    re.sub(r"\s{2,}|\n", " ", td.text.split("\n\n")[0])
    for td in soup.select("td:last-child")
]
print(out)

Prints:

['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']

With:

html = '''<tbody id="plaintiff-body">
                <tr><td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td><td>TIMOTHY MOORE</td><td></td><td>TIMOTHY MOORE</td></tr><tr id="pladetail0001" style="" valign="top"><td></td><td>62 KEENE DRIVE<br>WESTERVILLE, OH 43081</td><td></td><td>62 KEENE DRIVE<br>WESTERVILLE, OH 43081</td></tr>
                </tbody>'''

Prints:

['TIMOTHY MOORE', '62 KEENE DRIVE WESTERVILLE, OH 43081']

With:

html = '''<tbody id="plaintiff-body">
                <tr><td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td><td>CENA PEDRO</td><td></td><td>ELIZABETH R WERNER</td></tr><tr id="pladetail0001" style="" valign="top"><td></td><td>33 W WEISHEIMER RD<br>COLUMBUS, OH 43215</td><td></td><td>THE NIGH LAW GROUP, LLC  <br>300 S. 2ND STREET<br>COLUMBUS, OH 43215<br>(614) 379-6444<br><br>JOSEPH A NIGH<br>THE NIGH LAW GROUP, LLC  <br>300 S. 2ND STREET<br>COLUMBUS, OH 43215<br>(614) 379-6444</td></tr>
                </tbody>'''

Prints:

['ELIZABETH R WERNER', 'THE NIGH LAW GROUP, LLC 300 S. 2ND STREET COLUMBUS, OH 43215 (614) 379-6444']

Solution 2:[2]

Happy to know if there a direct solution with CSS ...

Main issue here is that sibling combinators br+br in CSS ignore all non-element nodes between elements including comments, text and whitespace, so as far as CSS is concerned, you wont get the two consecutive to go from there.

So your idea with a function to check for the tags would also be my approach:

from bs4 import BeautifulSoup
html = '''<tbody id="plaintiff-body">
   <tr>
      <td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td>
      <td>JENEE BENNETT</td>
      <td></td>
      <td>COURTNEY L HANNA</td>
   </tr>
   <tr id="pladetail0001" style="" valign="top">
      <td></td>
      <td>2348 WOODBROOK CIR N<br>UNIT D<br>COLUMBUS, OH 43223</td>
      <td></td>
      <td>JOSEPH &amp; JOSEPH CO LPA   <br>SUITE 200<br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>DEBORAH L MCNINCH<br>JOSEPH &amp; JOSEPH CO LPA   <br>THE WATERFORD, SUITE 200 <br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>S K DODDERER<br>155 W MAIN STREET<br>#200<br>COLUMBUS, OH 43215<br>(614) 449-8282</td>
   </tr>
</tbody>'''

soup = BeautifulSoup(html, 'lxml')

def check(x):
    s = []
    for a,b in zip(x,x[1::]):
        if a==b:
            break
        if a.name == None:
            s.append(a.text.strip())
    return ' '.join(s)

att = [check(x.contents) if len(x.contents) > 1 else x.get_text(strip=True) for x in soup.select('#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')]
print(att)

Output

['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']

Solution 3:[3]

You could stop if two <br> tags are found:

soup = BeautifulSoup(html, 'lxml')
tds = soup.select('#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')
output = []

for td in tds:
    entry = []
    last_el = None
    
    for el in td.descendants:
        if el.name == 'br':
            if last_el.name == 'br':
                break
        else:
            entry.append(el.get_text(strip=True))
        
        last_el = el
        
    output.append(' '.join(entry))
    
print(output)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrej Kesely
Solution 2 HedgeHog
Solution 3