'Python. email. Unknown encoding

I'm trying to get content of emails from this file - 20030228_spam.tar.bz2 The dataset is so called Apache SpamAssassin’s public datasets https://spamassassin.apache.org/old/publiccorpus/

I use the tarfile to get spam files

import tarfile
from email.parser import BytesParser, Parser
from email.policy import default
from pathlib import Path

DATA_DIR = Path('./data')
FILE_SPAM = Path(DATA_DIR, '20030228_spam.tar.bz2')

spam_mails = []

for file_, list_ in [(FILE_SPAM, spam_mails)]:
    with tarfile.open(file_) as tar_file:
        for item in tar_file.getmembers()[:]:
            if item.isfile():
                mail_bytes = tar_file.extractfile(item).read()
                mail = BytesParser(policy=default).parsebytes(mail_bytes)
                list_.append(mail)

I have a problem with a getting content of some emails.

list_exeption_index = []

for i, mail in enumerate(spam_mails):
    if not mail.is_multipart():
        try:
            mail.get_content()
        except Exception as ex:
            list_exeption_index.append(i)
list_exeption_index

output:
[217, 319, 388, 467]

There are 4 emails with some problems

for i in list_exeption_index:
    mail = spam_mails[i]
    try:
        mail.get_content()
    except Exception as ex:
        print(ex)

output:
    unknown encoding: DEFAULT_CHARSET
    unknown encoding: unknown-8bit
    unknown encoding: DEFAULT
    'multipart/alternative'

What is the problem and how can I overcome it?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source