'Failing to parse email from Gmail correcty because of `\r\n`

I have a simple email in Gmail that looks like this:

Hi all

@alice - please prepare XXX for tomorrow
@bob - please prepare YYY for tomorrow

best,
Z

and I would like to fetch it, parse it and split by newline, so I would get a list of 5 elements:

['Hi all','@alice ...', '@bob ...', 'best,','Z']

but for some reason inside the sentence I get \r\n which makes me break the line into 2 lines although in the original email there wasn't new line.

I parse it as following (after getting the proper credentials)

txt = service.users().messages().get(userId=user.email, id=email_msg['id']).execute()
payload = txt["payload"]
headers = payload["headers"]

parts = payload.get("parts")[0]
data = parts["body"]["data"]
data = data.replace("-", "+").replace("_", "/")
decoded_message = str(base64.b64decode(data), "utf-8")
split = decoded_message.splitlines()
final_split = list(filter(None, split))

but then the message I get looks like this:

Hi all\r\n\r\n@alice - please prepare XXX\r\nfor tomorrow\r\n@bob - please prepare YYY for tomorrow\\r\nr\nbest,\n\rZ

so if I split by \r\n or \n I get invalid result



Solution 1:[1]

When you decode the data using b64decode() you don't get a string, instead you get a byte string. Here's an excellent explanation of the difference. Before trying to parse the message you have to convert it into a regular string.

You can do this by running .decode("utf-8"). Then you can just use .splitlines() to split the message.

txt = service.users().messages().get(userId=user.email, id=email_msg['id']).execute()
payload = txt["payload"]
headers = payload["headers"]

parts = payload.get("parts")[0]
data = parts["body"]["data"]
data = data.replace("-", "+").replace("_", "/")
decoded_data = base64.b64decode(data)

decoded_message = decoded_data.decode("utf-8") # decodes the byte string

split = decode_message.splitlines() # splits the message into a list

final_split = list(filter(None, split)) # this removes the blank lines

Running .decode() on the message will change it from this:

Hi all\r\n\r\n@alice - please prepare XXX\r\nfor tomorrow\r\n@bob - please prepare YYY for tomorrow\\r\nr\nbest,\n\rZ

To the original message:

Hi all

@alice - please prepare XXX for tomorrow
@bob - please prepare YYY for tomorrow

best,
Z

Then after .splitlines() you will get this list:

['Hi all', '', '@alice...', '@bob...', '', 'best,', 'Z']

Note that there are blank strings that correspond to the blank lines. To get rid of them you can run the last line final_split = list(filter(None, split)), which will give you what you're looking for. There are other methods as well:

['Hi all', '@alice...', '@bob...', 'best,', 'Z']

By the way, I did not install BeautifulSoup for this, but if you want to use it you probably want to add it after you decode the byte string.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1