'making sure that my handling of utf8 is correct
I am using Perl for a module that involves processing a lot of Unicode documents. I started getting nervous because I'm not opening and closing files with the utf8 layers like open (OUT, '>:utf8', $textfile). However, I have been thoroughly testing and the output was still as expected. So I want to better understand why.
In a nutshell, my Perl module passes a document to an external service and gets a response. The response will be in Utf8. It uses LWP::UserAgent for this. When it gets the response it just writes it to a file:
my $fh;
open($fh, '>', $outputpath) or die "Could not open file '$outputpath' $!";
print $fh $response->content;
close $fh;
I have diffed these files against Unicode files representing the "expected" output and it is fine. And yet, you can see in my open command that I was not using the utf8 layer. So why is that?
What if I just returned $response->content to some other process, instead of printing it? Would it still be proper Unicode then?
I also have a separate process that I would like to ask about, very similar question. In this case I am trying to build a new service which replaces an old one. The old one read from a file like open(my $fh, '<:utf8', $inputfile) and wrote to a new file like open(my $fh, '>:utf8', $outputfile). The new service will still read the same way, but will not write to the output file anymore. It will send the string to another server using HTTP, and on that server it will be printed to a file using open(my $fh, '>', $outputfile) so no utf8 layer. I can't change that code immediately.
I want the file contents to be the exact same as they would otherwise have been (none of the other processing rules are changing). Should I be nervous about losing the layer?
I think maybe it would help if I understood better what these layers are doing.
Solution 1:[1]
There is no "handling of utf8" in the main question and that in itself isn't right.
The whole thing works, as the server is sending utf8 as you say, in the following way.
The content method used on $response is from HTTP::Message
The content() method sets the raw content if an argument is given. If no argument is given the content is not touched. In either case the original raw content is returned.
Since you don't specify layers† in open the default is used, likely :unix:perlio for Unix, with no encoding (see PerlIO). So you are dumping the original bytes to the disk, unchanged.
Looking further down the page, at decoded_content( %options ), we see the default
default_charsetThis override the default charset guessed by content_charset() or if that fails "ISO-8859-1".
and can establish what you are getting by printing it
say 'Content type: ', $response->content_charset;
where you should get Content type: UTF-8. But when you receive a different encoding from the server then that will wind up in the file and any code that expects utf8 will break.
One should always decode all input and encode all output. Then we know exactly what is going on. As input is decoded the program carries on with character strings (not bytes in whatever encoding was sent). In the end encode suitably for output. This Effective Perler article should be useful. Here you'd use decoded_content and write files opened with :encoding(UTF-8).
With use open ":std", ":encoding(UTF-8)"; all I/O via standard streams in the lexical scope of this pragma will be handled as utf8. (This can be overriden for other specific uses, say by specifying layers in the three argument open.)
See open pragma.
As for the other question, you need to properly encode what you intend to "send to another server." How to do that depends on how you are "sending" it.
† With PerlIO the I/O "layers" can be set so that encoding of input and output is done as needed behind the scenes, as data is read or written. The work is done by Encode. For a nice explanation of the process see Encode::PerlIO. Also see perlunitut, perlunifaq, and perluniitro.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
