'How to convert cyrillic into utf16
tl;dr Is there a way to convert cyrillic stored in hashtable into UTF-16?
Like кириллица into \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430
I need to import file, parse it into id and value then convert it into .json and now im struggling to find a way to convert value into utf codes.
And yes, it is needed that way
cyrillic.txt:
1 кириллица
PH:
clear-host
foreach ($line in (Get-Content C:\Users\users\Downloads\cyrillic.txt)){
$nline = $line.Split(' ', 2)
$properties = @{
'id'= $nline[0] #stores "1" from file
'value'=$nline[1] #stores "кириллица" from file
}
$temp+=New-Object PSObject -Property $properties
}
$temp | ConvertTo-Json | Out-File "C:\Users\user\Downloads\data.json"
Output:
[
{
"id": "1",
"value": "кириллица"
},
]
Needed:
[
{
"id": "1",
"value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
},
]
At this point as a newcomer to PH i have no idea even how to search for it properly
Solution 1:[1]
Building on Jeroen Mostert's helpful comment, the following works robustly, assuming that the input file contains no NUL characters (which is usually a safe assumption for text files):
# Sample value pair; loop over file lines omitted for brevity.
$nline = '1 ?????????'.Split(' ', 2)
$properties = [ordered] @{
id = $nline[0]
# Insert aux. NUL characters before the 4-digit hex representations of each
# code unit, to be removed later.
value = -join ([uint16[]] [char[]] $nline[1]).ForEach({ "`0{0:x4}" -f $_ })
}
# Convert to JSON, then remove the escaped representations of the aux. NUL chars.,
# resulting in proper JSON escape sequences.
# Note: ... | Out-File ... omitted.
(ConvertTo-Json @($properties)) -replace '\\u0000', '\u'
Output (pipe to ConvertFrom-Json to verify that it works):
[
{
"id": "1",
"value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
}
]
Explanation:
[uint16[]] [char[]] $nline[1]converts the[char]instances of the strings stored in$nline[1]into the underlying UTF-16 code units (a .NET[char]is an unsigned 16-bit integer encoding a Unicode code point).- Note that this works even with Unicode characters that have code points above
0xFFFF, i.e. that are too large to fit into a[uint16]. Such characters outside the so-called BMP (Basic Multilingual Plane), e.g.?, are simply represented as pairs of UTF-16 code units, so-called surrogate pairs, which a JSON processor should recognize (ConvertFrom-Jsondoes). - However, on Windows such chars. may not render correctly, depending on your console window's font. The safest option is to use Windows Terminal, available in the Microsoft Store
- Note that this works even with Unicode characters that have code points above
The call to the
.ForEach()array method processes each resulting code unit:"`0{0:x4}" -f $_uses an expandable string to create a string that starts with aNULcharacter ("`0"), followed by a 4-digit hex. representation (x4) of the code unit at hand, created via-f, the format operator.- This trick of replacing what should ultimately be a verbatim
\uprefix temporarily with aNULcharacter is needed, because a verbatim\embedded in a string value would invariably be doubled in its JSON representation, given that\acts the escape character in JSON.
- This trick of replacing what should ultimately be a verbatim
The result is something like
"<NUL>043a", whichConvertTo-Jsontransforms as follows, given that it must escape eachNULcharacter as\u0000:"\u0000043a"
The result from
ConvertTo-Jsoncan then be transformed into the desired escape sequences simply by replacing\u0000(escaped as\\u0000for use with the regex-based-replaceoeprator) with\u, e.g.:"\u0000043a" -replace '\\u0000', '\u' # -> "\u043a", i.e. ?
Solution 2:[2]
Here's a way simply saving it to a utf16be file and then reading out the bytes, and formatting it, skipping the first 2 bytes, which is the bom (\ufeff). $_ didn't work by itself. Note that there's two utf16 encodings that have different byte orders, big endian and little endian. The range of cyrillic is U+0400..U+04FF. Added -nonewline.
'?????????' | set-content utf16be.txt -encoding BigEndianUnicode -nonewline
$list = get-content utf16be.txt -Encoding Byte -readcount 2 |
% { '\u{0:x2}{1:x2}' -f $_[0],$_[1] } | select -skip 1
-join $list
\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430
Solution 3:[3]
There must be a simpler way of doing this, but this could work for you:
$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
$nline = $line.Split(' ', 2)
# output an object straight away so it gets collected in variable $temp
[PsCustomObject]@{
id = $nline[0] #stores "1" from file
value = (([system.Text.Encoding]::BigEndianUnicode.GetBytes($nline[1]) |
ForEach-Object {'{0:x2}' -f $_ }) -join '' -split '(.{4})' -ne '' |
ForEach-Object { '\u{0}' -f $_ }) -join ''
}
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'
Simpler using .ToCharArray():
$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
$nline = $line.Split(' ', 2)
# output an object straight away so it gets collected in variable $temp
[PsCustomObject]@{
id = $nline[0] #stores "1" from file
value = ($nline[1].ToCharArray() | ForEach-Object {'\u{0:x4}' -f [uint16]$_ }) -join ''
}
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'
Value "?????????" will be converted to \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 |
