'Python. email. Unknown encoding
I'm trying to get content of emails from this file - 20030228_spam.tar.bz2 The dataset is so called Apache SpamAssassin’s public datasets https://spamassassin.apache.org/old/publiccorpus/
I use the tarfile to get spam files
import tarfile
from email.parser import BytesParser, Parser
from email.policy import default
from pathlib import Path
DATA_DIR = Path('./data')
FILE_SPAM = Path(DATA_DIR, '20030228_spam.tar.bz2')
spam_mails = []
for file_, list_ in [(FILE_SPAM, spam_mails)]:
with tarfile.open(file_) as tar_file:
for item in tar_file.getmembers()[:]:
if item.isfile():
mail_bytes = tar_file.extractfile(item).read()
mail = BytesParser(policy=default).parsebytes(mail_bytes)
list_.append(mail)
I have a problem with a getting content of some emails.
list_exeption_index = []
for i, mail in enumerate(spam_mails):
if not mail.is_multipart():
try:
mail.get_content()
except Exception as ex:
list_exeption_index.append(i)
list_exeption_index
output:
[217, 319, 388, 467]
There are 4 emails with some problems
for i in list_exeption_index:
mail = spam_mails[i]
try:
mail.get_content()
except Exception as ex:
print(ex)
output:
unknown encoding: DEFAULT_CHARSET
unknown encoding: unknown-8bit
unknown encoding: DEFAULT
'multipart/alternative'
What is the problem and how can I overcome it?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
