'How to clean up pulled data from BeautifulSoup, Pandas, Python

Hello everyone I have the information I want pulled using BeautiuflSoup but I can't seem to get it printed out correctly to send to pandas and excel.

html_f ='''
<li class="list-group-item">
<div>
<div class="tyler-toggle-controller open">
<p class="text-primary">
07/01/2022 Date
<span class="caret"> </span>
</p>
</div>
<div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;">
<p class="col-sm-12 col-md-12">
<span class="text-muted">Comment</span><br>
[1] Comments
</p>
</div>
</div>
</li>'''

My code used to pull the data I want:

soup = BeautifulSoup(html_f,'html.parser')
for child in soup.findAll('li',class_='list-group-item')[0]:
    print (child.text)

Here is the info it pulls But it prints it out weird with tons of spacing

        07/01/2022   Date





  Comment
       [1] Comments

Ideally, I only need the top portion of (date and File Date) printed out but at the very least I need help getting it into a list format like:

07/01/2022 Date
Comment
[1] Comments


Solution 1:[1]

So far so good, it's my trying

doc='''

<li class="list-group-item">
 <div>
  <div class="tyler-toggle-controller open">
   <p class="text-primary">
    07/01/2022 Date
    <span class="caret">
    </span>
   </p>
  </div>
  <div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;"> 
   <p class="col-sm-12 col-md-12">
    <span class="text-muted">
     Comment
    </span>
    <br/>
    [1] Comments
   </p>
  </div>
 </div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc, 'html.parser')

text=[' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
print(text)

Output:

['07/01/2022, Comments']   

Try this ways,must work

text=' '.join([' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]).strip()
#Or
text= [' '.join(child.get_text(strip=True).split(' ')).replace(' DateComment[1]',',') for child in soup.find_all('li',class_='list-group-item')]
final_text= text[1]+ ',' +text[2]
final_text= text[1]+text[2].split()#if you want to make list



  

Solution 2:[2]

To get your information printed as expected in your question, you could use stripped_strings and iterate over its elements:

for e in soup.find_all('li',class_='list-group-item'):
    for t in list(e.stripped_strings):
        print(t)

Note: In new code use find_all() instead of old syntax findAll().

Example

html='''
<li class="list-group-item">
 <div>
  <div class="tyler-toggle-controller open">
   <p class="text-primary">
    07/01/2022 Date
    <span class="caret">
    </span>
   </p>
  </div>
  <div class="tyler-toggle-container row-buff" style="display: block; overflow: hidden;"> 
   <p class="col-sm-12 col-md-12">
    <span class="text-muted">
     Comment
    </span>
    <br/>
    [1] Comments
   </p>
  </div>
 </div>
</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

for e in soup.find_all('li',class_='list-group-item'):
    for t in list(e.stripped_strings):
        print(t)

Output

07/01/2022 Date
Comment
[1] Comments

Not sure cause you are talking about pandas, you also could pick each information, clean it up and append to a list of dicts:

data = []
for e in soup.find_all('li',class_='list-group-item'):
    data.append({
        'date': e.p.text.strip().replace(' Date',''),
        'comment': e.select_one('.tyler-toggle-container br').next_sibling.strip()
    })
pd.DataFrame(data)

or

data = [{
    'date':soup.select_one('li.list-group-item .text-primary').text.strip().replace(' Date',''),
    'comment':soup.select_one('li.list-group-item .tyler-toggle-container br').next_sibling.strip()
}]

Output

date comment
07/01/2022 [1] Comments

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2