'Can't get Ruby to accept UTF-8 input
I had this very problem since several version of Ruby ago, and I even changed both computer and OS in the meanwhile. Still, I can't get through it at all. The point is that now I'm using Ruby to produce graphical overlays for my professional streaming services, so I really need to get through this once and for all.
Let's consider this thread a gigantic update to this old question I posted 1 year and 8 months ago, pertaining to what then was the current version of Ruby. Now I'm working on Windows 10, with Ruby being at version 3.1.1.
Here's a MWE:
puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd
Here's what happens if I type any of the accented letters:
Here's what happens if I type "€":
In the old thread I mentioned above I used two commands that shouldn't be needed anymore. But let's try them for the sake of argument:
`chcp 65001`
puts "Write something with accents such as àòèùì, or €"
asd = gets
puts asd
chcp 65001 should switch the terminal's encoding to UTF-8. Which should be the default, as of 2022. Though, if I use that line, something indeed changes... for the worse.
If I type any accented letter I'll have to press return twice after typing the characters. And I'll get two broken glyphs instead of one.
If I instead type the "€" symbol the program will instantly crash, even before I hit return.
Adding # encode: utf-8 doesn't indeed have any effect at all on the MWE, with or without the chcp 65001 command.
The issue here is that this little thing has deep consequences on any other program I write where I have to consider a user input that might include accented letters.
For instance, here's what happens if I try to get the user input via tty-prompt.
require "tty-prompt"
prompt = TTY::Prompt.new
asd = prompt.ask("Write something with accents such as àòèùì, or €")
puts asd
Accented letters appear as that broken glyph while being inserted, then disappear instead of being shown after I hit return:
The "€" symbol instead is just shown as a question mark, as usual:
This issue extends itself over characters that aren't even typed by me. For instance, Ruby isn't able to properly show the characters used by the gem tty-spinner. Here:
require "tty-spinner"
spinner = TTY::Spinner.new("[:spinner] Loading ...", format: :pulse_2)
spinner.auto_spin
sleep(2)
spinner.stop("Done!")
As you see it won't show the characters while being executed:
And finally, it actually WILL be able to read accented letters wrote on UTF-8 encoded text files, and it should be able to produce a UTF-8 encoded HTML file, but I'm using OBS to access that file and it is not being able to read it, which makes me wonder if that file really is being encoded in UTF-8, since OBS should be able to read it in that case.
This program...
def indent (indentazione, stringa)
unless indentazione == 0
for cont in 1..indentazione
stringa.prepend("\t")
end
end
return stringa
end
testo = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt', "r").readlines[0].chomp
pagina = File.open('C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro.html', "w:UTF-8")
pagina.puts(indent(0, "<html>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<head>"))
pagina.puts(indent(1, "<link rel=\"stylesheet\" href=\"../stile.css\">"))
pagina.puts(indent(0, "</head>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "<body>"))
pagina.puts(indent(1, "<div id=\"riquadro\">"))
pagina.puts(indent(2, "<p id=\"riquadro_testo\">" + testo + "</p>"))
pagina.puts(indent(1, "</div>"))
pagina.puts(indent(0, "</body>"))
pagina.puts(indent(0, ""))
pagina.puts(indent(0, "</html>"))
puts "Operazione completata"
...will read this text file...
...created by this bash code...
@ECHO OFF
chcp 65001
SET /P data1= "Inserisci il testo del riquadro: "
ECHO %data1%> "C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.txt"
"C:\Users\rapto\OneDrive\Documenti\Macro streaming\MietTV\riquadro\riquadro_updater.rb"
...and produce this HTML page...
<html>
<head>
<link rel="stylesheet" href="../stile.css">
</head>
<body>
<div id="riquadro">
<p id="riquadro_testo">La magia nera della narrazione: età dei personaggi</p>
</div>
</body>
</html>
...which will be correctly rendered by Opera...
...but not by OBS, which should be able to read UTF-8 encoded pages.
Luckily I can solve this latter problem by converting all accented letters to their respective HTML code. Still, it'd be nice if everything just worked.
To me it clearly looks like Ruby has some issue in managing UTF-8 encoded files. It totally might be me missing something in how to deal with them. It might be that I uncorrectly set something. All suggestions are welcome.
UPDATE
As indicated by @Holger Just the issue seems to be mostly caused by the default Windows 10 terminal. I solved the problem by downloading its sort-of-updated version from the Microsoft Store, "Windows Terminal".
If I use the first mwe I provided via said terminal I can effectively type the accented letters without hassle, correctly receiving them back as output:
It still doesn't work with the EUR symbol though:
The program will present similar issues as before if I include the chcp 65001 part. If I type an accented letter I'll need to press return twice, and then receive these two symbols as output:
It will crash if I type the EUR symbol.
Solution 1:[1]
This is probably related to most Windows shells NOT using UTF-8 encoding on their own. Thus, if an external program (such as your Ruby program) reads data from the shell it is likely not encoded in UTF-8 (as expected by Ruby) but in some other encoding depending on your system.
However, Ruby has no way to actually know the encoding of the data. You may have to tell it. Since Ruby 3.0, Ruby defaults to assume UTF-8 as the external encoding on Windows (see Feature #16604 for details). Previous versions used the "native" encoding of your Windows version which could cause all kinds of issues when writing data to e.g. files.
Now, that happens in your example is that Ruby reads from the Shell with gets. The shell provides some data which Ruby assumes to be in UTF-8 because of its Encoding.default_external setting but is not.
Depending on how the shell interprets the data sent by Ruby, things could be unexpected...
The only actual solution would be to make sure that your shell agrees with Ruby about the encoding of the data they exchange. For that, you likely need to adjust the settings of your shell.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Holger Just |














