'How to use Unicode chars with Nokogiri::XML::DocumentFragment
I want to use Unicode char with Nokogiri::XML::DocumentFragment.
frag = Nokogiri::XML::DocumentFragment.parse("<foo>ü</foo>")
=> <foo>ü</foo>
The unicode char is escaped. I need to set encoding: 'UTF-8' to get a readable char.
frag.to_html(encoding: 'UTF-8')
=> "<foo>ü</foo>"
Is there a option for encoding when parsing the string?
Nokogiri::HTML::DocumentFragment.parse treat the string as I expected, but I need to use XML.
frag = Nokogiri::HTML::DocumentFragment.parse("<foo>ü</foo>")
=> <foo>ü</foo>
Solution 1:[1]
According to the documentation here the text is internally stored as UTF-8 already.
Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document.
So if you call for example #text on your frag instead of printing the entire frag object, you'll see the ü printed correctly
puts frag.text
# => ü
Otherwise you can use #XML instead of #DocumentFragment directly and pass the encoding explicitly.
doc = Nokogiri.XML('<foo>ü</foo>', nil, 'UTF-8')
puts doc
# => <?xml version="1.0" encoding="UTF-8"?>
# => <foo>ü</foo>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Erik Brüggemann |
