'How to convert cyrillic into utf16

tl;dr Is there a way to convert cyrillic stored in hashtable into UTF-16? Like кириллица into \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

I need to import file, parse it into id and value then convert it into .json and now im struggling to find a way to convert value into utf codes.

And yes, it is needed that way

cyrillic.txt:

1 кириллица

PH:

clear-host
foreach ($line in (Get-Content C:\Users\users\Downloads\cyrillic.txt)){
    $nline = $line.Split(' ', 2)
    $properties = @{
        'id'= $nline[0] #stores "1" from file
        'value'=$nline[1] #stores "кириллица" from file
    }
    $temp+=New-Object PSObject -Property $properties
}
$temp | ConvertTo-Json | Out-File "C:\Users\user\Downloads\data.json"

Output:

[
    {
        "id":  "1",
        "value":  "кириллица"
    },
]

Needed:

[
    {
        "id":  "1",
        "value":  "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
    },
]

At this point as a newcomer to PH i have no idea even how to search for it properly

Solution 1:^[1]

Building on Jeroen Mostert's helpful comment, the following works robustly, assuming that the input file contains no NUL characters (which is usually a safe assumption for text files):

# Sample value pair; loop over file lines omitted for brevity.
$nline = '1 ?????????'.Split(' ', 2)

$properties = [ordered] @{
  id = $nline[0]
  # Insert aux. NUL characters before the 4-digit hex representations of each
  # code unit, to be removed later.
  value = -join ([uint16[]] [char[]] $nline[1]).ForEach({ "`0{0:x4}" -f $_ })
}

# Convert to JSON, then remove the escaped representations of the aux. NUL chars.,
# resulting in proper JSON escape sequences.
# Note: ... | Out-File ... omitted.
(ConvertTo-Json @($properties)) -replace '\\u0000', '\u'

Output (pipe to ConvertFrom-Json to verify that it works):

[
  {
    "id": "1",
    "value": "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"
  }
]

Explanation:

[uint16[]] [char[]] $nline[1] converts the [char] instances of the strings stored in $nline[1] into the underlying UTF-16 code units (a .NET [char] is an unsigned 16-bit integer encoding a Unicode code point).
- Note that this works even with Unicode characters that have code points above 0xFFFF, i.e. that are too large to fit into a [uint16]. Such characters outside the so-called BMP (Basic Multilingual Plane), e.g. ?, are simply represented as pairs of UTF-16 code units, so-called surrogate pairs, which a JSON processor should recognize (ConvertFrom-Json does).
- However, on Windows such chars. may not render correctly, depending on your console window's font. The safest option is to use Windows Terminal, available in the Microsoft Store
The call to the .ForEach() array method processes each resulting code unit:
- "`0{0:x4}" -f $_ uses an expandable string to create a string that starts with a NUL character ("`0"), followed by a 4-digit hex. representation (x4) of the code unit at hand, created via -f, the format operator.
  - This trick of replacing what should ultimately be a verbatim \u prefix temporarily with a NUL character is needed, because a verbatim \ embedded in a string value would invariably be doubled in its JSON representation, given that \ acts the escape character in JSON.
- The result is something like "<NUL>043a", which ConvertTo-Json transforms as follows, given that it must escape each NUL character as \u0000:
```
"\u0000043a"
```
The result from ConvertTo-Json can then be transformed into the desired escape sequences simply by replacing \u0000 (escaped as \\u0000 for use with the regex-based -replace oeprator) with \u, e.g.:
```
  "\u0000043a" -replace '\\u0000', '\u' # -> "\u043a", i.e. ?
```

Solution 2:^[2]

Here's a way simply saving it to a utf16be file and then reading out the bytes, and formatting it, skipping the first 2 bytes, which is the bom (\ufeff). $_ didn't work by itself. Note that there's two utf16 encodings that have different byte orders, big endian and little endian. The range of cyrillic is U+0400..U+04FF. Added -nonewline.

'?????????' | set-content utf16be.txt -encoding BigEndianUnicode -nonewline
$list = get-content utf16be.txt -Encoding Byte -readcount 2 | 
  % { '\u{0:x2}{1:x2}' -f $_[0],$_[1] } | select -skip 1
-join $list

\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

Solution 3:^[3]

There must be a simpler way of doing this, but this could work for you:

$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
    $nline = $line.Split(' ', 2)
    # output an object straight away so it gets collected in variable $temp
    [PsCustomObject]@{
        id    = $nline[0]   #stores "1" from file
        value = (([system.Text.Encoding]::BigEndianUnicode.GetBytes($nline[1]) | 
                ForEach-Object {'{0:x2}' -f $_ }) -join '' -split '(.{4})' -ne '' | 
                ForEach-Object { '\u{0}' -f $_ }) -join ''
    }
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'

Simpler using .ToCharArray():

$temp = foreach ($line in (Get-Content -Path 'C:\Users\users\Downloads\cyrillic.txt')){
    $nline = $line.Split(' ', 2)
    # output an object straight away so it gets collected in variable $temp
    [PsCustomObject]@{
        id    = $nline[0]   #stores "1" from file
        value = ($nline[1].ToCharArray() | ForEach-Object {'\u{0:x4}' -f [uint16]$_ }) -join ''
    }
}
($temp | ConvertTo-Json) -replace '\\\\u', '\u' | Out-File 'C:\Users\user\Downloads\data.json'

Value "?????????" will be converted to \u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2
Solution 3

'How to convert cyrillic into utf16

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]