'Get html using Python requests?
I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this:
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
Instead of the basic html that is the source for this page, I get:
>>> r.text
'\x1f\ufffd\x08\x00\x00\x00\x00\x00\x00\x03\ufffd]o\u06f8\x12\ufffd\ufffd\ufffd+\ufffd]...
>>> r.content
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d]o\xdb\xb8\x12\x86\xef\xfb+\x88]\x14h...
I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don't understand what I am seeing above, haven't been able to turn it into anything I can read, and can't figure out how to get what I actually want. My question is, how do I get the html for the above page?
Solution 1:[1]
The HTTP headers for this URL have now been fixed.
>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}
Solution 2:[2]
I'd solve that problem in a more simple way. Just import html library to decode HTML special characters:
import html
r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
print(html.unescape(r.text))
Solution 3:[3]
Here is an example using the BeautifulSoup library. It "makes it easy to scrape information from web pages."
from bs4 import BeautifulSoup
import requests
# request web page
resp = requests.get("http://example.com")
# get the response text. in this case it is HTML
html = resp.text
# parse the HTML
soup = BeautifulSoup(html, "html.parser")
# print the HTML as text
print(soup.body.get_text().strip())
and the result
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Grant |
| Solution 2 | Ângelo Polotto |
| Solution 3 | aidanmelen |
