'How should I scrape all <em> tag innertexts within a <ul> and make them into a pandas dataframe?
I am currently trying to scrape the information I want from a website.
The information that I want is contained within a ul>li>em. I have scraped tables before, but I have never scraped lists.
How should I scrape the information I want?
In addition, I want to know if there is a way to make all the innertexts in <em> and put them in a dataframe.
The <ul> basically looks like this.
<ul class="reportData">
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
......
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
</ul>
Solution 1:[1]
Just select your <ul> and use in this case stripped_strings to get all text in a list:
data = soup.select_one('ul.reportData').stripped_strings
or more specific with list comprehensionfrom all em
data = [e.text for e in soup.select('ul.reportData em')]
Example
import pandas as pd
from bs4 import BeautifulSoup
html='''
<ul class="reportData">
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
</ul>
'''
soup = BeautifulSoup(html)
data = soup.select_one('ul.reportData').stripped_strings
pd.DataFrame(data, columns=['date'])
Output
| date |
|---|
| 2015-12-28 |
| 2015-12-28 |
| 2015-12-28 |
| 2015-12-28 |
| 2015-12-28 |
Solution 2:[2]
find_all returns a list, which you can directly import in pandas:
from bs4 import BeautifulSoup
import pandas as pd
html = '''<ul class="reportData">
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
<li><em>2015-12-28</em></li>
</ul>'''
soup = BeautifulSoup(html)
df = pd.DataFrame([i.get_text() for i in soup.find('ul', class_='reportData').find_all('em')], columns=['date'])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | RJ Adriaansen |
