'UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

I'm trying to open a series of HTML files in order to get the text from the body of those files using BeautifulSoup. I have about 435 files that I wanted to run through but I keep getting this error.

I've tried converting the HTML files to text and opening the text files but I get the same error...

path = "./Bitcoin"
for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

I want to get the source code of the HTML file so I can parse it using beautifulsoup but I get this error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-133-f32d00599677> in <module>
      3 for file in os.listdir(path):
      4     with open(os.path.join(path, file), "r") as fname:
----> 5         txt = fname.read()

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte


Solution 1:[1]

There are various approaches to dealing with text data with unknown encodings. However in this case, as you intend pass the data to Beautiful Soup, the solution is simple: don't bother trying to decode the file yourself, let Beautiful Soup do it. Beautiful Soup will automatically decode bytes to unicode.

In your current code, you read the file in text mode, which means that Python will assume that the file is encoded as UTF-8 unless you provide an encoding argument to the open function. This causes an error if the file's contents are not valid UTF-8.

for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

Instead, read the html files in binary mode and pass the resulting bytes instance to Beautiful Soup.

for file in os.listdir(path):
    with open(os.path.join(path, file), "rb") as fname:
        bytes_ = fname.read()
soup = BeautifulSoup(bytes_)

FWIW, the file currently causing your problem is probably encoded with cp1252 or a similar windows 8-bit encoding.

>>> '’'.encode('cp1252')
b'\x92'

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 snakecharmerb