'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I am trying to make a crawler in python by following an udacity course. I have this method get_page() which returns the content of the page.
def get_page(url):
'''
Open the given url and return the content of the page.
'''
data = urlopen(url)
html = data.read()
return html.decode('utf8')
the original method was just returning data.read(), but that way I could not do operations like str.find(). After a quick search I found out I need to decode the data. But now I get this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I have found similar questions in SO but none of them were specifically for this. Please help.
Solution 1:[1]
You are trying to decode an invalid string.
The start byte of any valid UTF-8 string must be in the range of 0x00 to 0x7F.
So 0x8B is definitely invalid.
From RFC3629 Section 3:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number.
You should post the string you are trying to decode.
Solution 2:[2]
Maybe the page is encoded with other character encoding but 'utf-8'. So the start byte is invalid. You could do this.
def get_page(self, url):
if url is None:
return None
response=urllib.request.urlopen(url)
if response.getcode()!=200:
print("Http code:",response.getcode())
return None
else:
try:
return response.read().decode('utf-8')
except:
return response.read()
Solution 3:[3]
Web servers often serve HTML pages with a Content-Type header that includes the encoding used to encoding the page. The header might look this:
Content-Type: text/html; charset=UTF-8
We can inspect the content of this header to find the encoding to use to decode the page:
from urllib.request import urlopen
def get_page(url):
""" Open the given url and return the content of the page."""
data = urlopen(url)
content_type = data.headers.get('content-type', '')
print(f'{content_type=}')
encoding = 'latin-1'
if 'charset' in content_type:
_, _, encoding = content_type.rpartition('=')
print(f'{encoding=}')
html = data.read()
return html.decode(encoding)
Using requests is similar:
response = requests.get(url)
content_type = reponse.headers.get('content-type', '')
Latin-1 (or ISO-8859-1) is a safe default: it will always decode any bytes (though the result may not be useful).
If the server doesn't serve a content-type header you can try looking for a <meta> tag that specifies the encoding in the HTML. Or pass the response bytes to Beautiful Soup and let it try to guess the encoding.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Community |
| Solution 2 | Qi Liu |
| Solution 3 |
