'Convert Outlook htmlrtf to html in perl

I managed to extract the RTF part from an Outlook .msg using Email::Outlook::Message. Here's how it looks:

{\*\htmltag84 <b>}\htmlrtf {\b \htmlrtf0
{\*\htmltag148 <span lang="EN-US" style="font-size:12.0pt;color:#002060;mso-fareast-language:EN-IN">}\htmlrtf {\lang1033 \htmlrtf0 FooBar
{\*\htmltag156 </span>}\htmlrtf }\htmlrtf0 
{\*\htmltag92 </b>}\htmlrtf }\htmlrtf0

When Outlooks sends an Internet Mail it converts the RTF to text/html:

<b><span style="font-size:12.0pt;color:#002060;mso-fareast-language:EN-IN">FooBar</span></b>

I'm trying to do the same using RTF::HTML::Converter, but it strips all styling:

<b>FooBar</b>

Here's the script:

use strict;
use RTF::HTML::Converter;
my $object = RTF::HTML::Converter->new(
        output => \*STDOUT
);
local *RTF_FILE;
open RTF_FILE, "$ARGV[0]" or die $!;
$object->parse_stream( \*RTF_FILE );

I also tried the unrtf tool. It also strips the styles:

<font face="Arial"><font size="3"><b>FooBar</b></font></font>


Solution 1:[1]

In your example it looks like the rtf contol words are redundant and wrapped by \htmlrtf \htmlrtf0. It might be sufficient for your usecase to strip them completely and use only the html tags. (This naive approach will probably break if you have more advanced formatting or embedded images etc.)

use strict;
use warnings;


while (my $line = <>){
    $line =~ s|\\htmlrtf.*?\\htmlrtf0||;
    $line =~ s|{\\\*\\htmltag\d+([^}]*)}|$1|;
    print $line;
}
perl test.pl test.rtf
 <b>
 <span lang="EN-US" style="font-size:12.0pt;color:#002060;mso-fareast-language:EN-IN"> FooBar
 </span> 
 </b>

Solution 2:[2]

You would need to parse RTF to extract HTML, I am not aware of any libraries that do that.

If using Redemption (I am its author) is an option, it exposes RDOSession.GetMessageFromMsgFile, which returns RDOMail object - you can read its HTMLBody property; it will extract HTML from RTF for you.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 clamp
Solution 2