'How to get python to tolerate UTF-8 encoding errors

I have a set of UTF-8 texts I have scraped from web pages. I am trying to extract keywords from these files like so:

import os
import json
from rake_nltk import Rake
rake_nltk_var = Rake()
directory = 'files'
results = {}
for filename in os.scandir(directory):
    if filename.is_file():
        with open("files/" + filename.name, encoding="utf-8", mode = 'r') as infile:
            text = infile.read()
        rake_nltk_var.extract_keywords_from_text(text)
        keyword_extracted = rake_nltk_var.get_ranked_phrases()
        results[filename.name] = keyword_extracted
with open("extracted-keywords.json", "w") as outfile:
    json.dump(results, outfile)

One of the files I've managed to process so far is throwing the following error on read:

Traceback (most recent call last):
  File "extract-keywords.py", line 11, in <module>
    text = infile.read()
  File "c:\python36\lib\codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte

0x92 is a right single quotation mark, but the 66th char of the file is a "u" so IDK where this error is coming from. Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same? I have a lot of files and can't afford to stop and debug every encoding error they might contain.

Solution 1:^[1]

I have a set of UTF-8 texts I have scraped from web pages

If they can't be read with the script you've shown, then these are not actually UTF-8 encoded files.

We have to know about the code which wrote the files in the first place to tell the correct way to decode. However, the ’ character is 0x92 byte in code page 1252, so try using that encoding instead, i.e.:

with open("files/" + filename.name, encoding="cp1252") as infile:
    text = infile.read()

Ignoring decoding errors corrupts the data, so it's best to use the correct decoder when possible, so try and do that first! However, about this part of the question:

Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same?

Yes, you can specify errors="replace"

>>> with open("/tmp/f.txt", "w", encoding="cp1252") as f:
...     f.write('this is a right quote: \N{RIGHT SINGLE QUOTATION MARK}')
... 

>>> with open("/tmp/f.txt", encoding="cp1252") as f:
...     print(f.read())  # using correct encoding
... 
this is a right quote: ’

>>> with open("/tmp/f.txt", encoding="utf-8", errors="replace") as f:
...     print(f.read())  # using incorrect encoding and replacing errors
this is a right quote: ?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'How to get python to tolerate UTF-8 encoding errors

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]