'Extract data from webpage with Python
For training purposes ,i'am trying to extract the line :
<td>2017/01/15</td>
from the following webpage (inspect element preview) :
<div class="bodyy">
<div id="FullPart">
<p class="d_intro">
<table id="ldeface" cellpadding="0" cellspacing="0">
<tbody><tr>
<td class="dtime">Date</td>
<td class="datt">Notifier</td>
<td class="dHMR">H</td>
<td class="dHMR">M</td>
<td class="dHMR">R</td>
<td class="dhMR">L</td>
<td class="dR"><img src="/images/star.gif" border="0"></td>
<td class="dDom">Domain</td>
<td class="dos">OS</td>
<td class="dview">View</td>
</tr>
<tr>
<td>2017/02/10</td>
<td><a href="/testarchive/</a></td>
<td></td>
<td></td>
<td></td>
I'am confused how will i get the td parts and which parts are correct (class/id) in order to fetch the correct information with BeatifulSoup. Thanks in advance
Solution 1:[1]
For your example you should use next thing.
from bs4 import BeautifulSoup
soup = BeautifulSoup('yor_html_source', 'html.parser')
for table in soup.find_all('table'):
tr = table.findAll('tr')[1]
td = tr.findAll('td')[0].text
print(td) # return 2017/02/10
If you want get just <td>2017/02/10</td> remove text property from td variable.
BeautifulSoup4 have also cool Soup documentation
Solution 2:[2]
Gather The Data:
To get the data to process you can use urllib2
import urllib2
resource = urllib2.urlopen("http://www.somewebsite.com/somepage")
html = resource.read()
# assuming html is the example with a few more rows in the table
Process the Data:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
for table in soup.findAll("table"):
if table.attrs['id'] == 'ldeface':
rows = table.findAll("tr")
header = rows[0]
date_col = [ i for i, col in enumerate(header.findAll("td")) if col.text == "Date"][0]
for row in rows[1:]:
print row.findAll("td")[date_col].text
Result:
2017/02/10
2017/02/11
2017/03/10
...
You can extract other columns based on the text in the cell, the id attribute like I did for the table, or the class attribute in a similar way to the table
Solution 3:[3]
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
tags=[str(tag) for tag in soup.find_all()]
for elem in tags:
if '<td>' in elem and len(elem.split('/')==4):
print(elem.text)
Go through all the tags, if the tag is a td and has the right amount of slashes, print it.
Solution 4:[4]
You can install urllib3 and bs4 via pip:
python3 -m pip install urllib3
python3 -m pip install beautifulsoup4
After that the python code should look something like this:
#!/usr/bin/python3
# -*- coding: UTF-8
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
page = http.request("GET","http://HOST.TLD/CONTENT")
if page.status == 200:
html = page.data
soup = BeautifulSoup(html, "html.parser")
h1 = soup.find('h1') # we are interested only in the 1st
print("H1: %s" % (h1))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Robb |
| Solution 3 | whackamadoodle3000 |
| Solution 4 | flowtron |
