'Unable to read Outlook(.msg) files in R
I have a lot (1400) of outlook emails (.msg format) which I want to process further. R meets most of my text mining needs but for this I'm unable to find any solution. I have used readMail from tm.plugin.mail, but haven't been successful
newsgroup <- file.path("D:", "mails")
news <- VCorpus(DirSource(newsgroup), readerControl = list(reader = readMail))
inspect(news)
Any help/suggestion would be greatly appreciated
Thanks!...
Solution 1:[1]
You can now use msgxtractr to do this:
devtools::install_github("hrbrmstr/msgxtractr")
library(msgxtractr)
print(str(read_msg(system.file("extdata/unicode.msg", package="msgxtractr"))))
## List of 7
## $ headers :Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 18 variables:
## ..$ Return-path : chr "<[email protected]>"
## ..$ Received :List of 1
## .. ..$ : chr [1:4] "from st11p00mm-smtpin007.mac.com ([17.172.84.240])\nby ms06561.mac.com (Oracle Communications Messaging Server "| __truncated__ "from mail-vc0-f182.google.com ([209.85.220.182])\nby st11p00mm-smtpin007.mac.com\n(Oracle Communications Messag"| __truncated__ "by mail-vc0-f182.google.com with SMTP id ie18so3484487vcb.13 for\n<[email protected]>; Mon, 18 Nov 2013 00:26:25 -0800 (PST)" "by 10.58.207.196 with HTTP; Mon, 18 Nov 2013 00:26:24 -0800 (PST)"
## ..$ Original-recipient : chr "rfc822;[email protected]"
## ..$ Received-SPF : chr "pass (st11p00mm-smtpin006.mac.com: domain of [email protected]\ndesignates 209.85.220.182 as permitted sender)\"| __truncated__
## ..$ DKIM-Signature : chr "v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com;\ns=20120113; h=mime-version:date:message-id:subject:f"| __truncated__
## ..$ MIME-version : chr "1.0"
## ..$ X-Received : chr "by 10.221.47.193 with SMTP id ut1mr14470624vcb.8.1384763184960;\nMon, 18 Nov 2013 00:26:24 -0800 (PST)"
## ..$ Date : chr "Mon, 18 Nov 2013 10:26:24 +0200"
## ..$ Message-id : chr "<CADtJ4eNjQSkGcBtVteCiTF+YFG89+AcHxK3QZ=-Mt48xygkvdQ@mail.gmail.com>"
## ..$ Subject : chr "Test for TIF files"
## ..$ From : chr "Brian Zhou <[email protected]>"
## ..$ To : chr "[email protected]"
## ..$ Cc : chr "Brian Zhou <[email protected]>"
## ..$ Content-type : chr "multipart/mixed; boundary=001a113392ecbd7a5404eb6f4d6a"
## ..$ Authentication-results : chr "st11p00mm-smtpin007.mac.com; dkim=pass\nreason=\"2048-bit key\" header.d=gmail.com [email protected]\nheader."| __truncated__
## ..$ x-icloud-spam-score : chr "33322\nf=gmail.com;e=gmail.com;pp=ham;spf=pass;dkim=pass;wl=absent;pwl=absent"
## ..$ X-Proofpoint-Virus-Version: chr "vendor=fsecure\nengine=2.50.10432:5.10.8794,1.0.14,0.0.0000\ndefinitions=2013-11-18_02:2013-11-18,2013-11-17,19"| __truncated__
## ..$ X-Proofpoint-Spam-Details : chr "rule=notspam policy=default score=0 spamscore=0\nsuspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifie"| __truncated__
## $ sender :List of 2
## ..$ sender_email: chr "[email protected]"
## ..$ sender_name : chr "Brian Zhou"
## $ recipients :List of 2
## ..$ :List of 3
## .. ..$ display_name : NULL
## .. ..$ address_type : chr "SMTP"
## .. ..$ email_address: chr "[email protected]"
## ..$ :List of 3
## .. ..$ display_name : NULL
## .. ..$ address_type : chr "SMTP"
## .. ..$ email_address: chr "[email protected]"
## $ subject : chr "Test for TIF files"
## $ body : chr "This is a test email to experiment with the MS Outlook MSG Extractor\r\n\r\n\r\n-- \r\n\r\n\r\nKind regards\r\n"| __truncated__
## $ attachments :List of 2
## ..$ :List of 4
## .. ..$ filename : chr "importOl.tif"
## .. ..$ long_filename: chr "import OleFileIO.tif"
## .. ..$ mime : chr "image/tiff"
## .. ..$ content : raw [1:969674] 49 49 2a 00 ...
## ..$ :List of 4
## .. ..$ filename : chr "raisedva.tif"
## .. ..$ long_filename: chr "raised value error.tif"
## .. ..$ mime : chr "image/tiff"
## .. ..$ content : raw [1:1033142] 49 49 2a 00 ...
## $ display_envelope:List of 2
## ..$ display_cc: chr "Brian Zhou"
## ..$ display_to: chr "[email protected]"
## NULL
Solution 2:[2]
The easiest way to do it would be to make use of the excellent Python msg extractor that you can source from GitHub here. If you feel like being creative you can make use of the rPython package to encapsulate that code in R.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | hrbrmstr |
| Solution 2 | Konrad |
