'How to make charArray that doesn't separate diacritics?

I'm trying to separate a Hebrew word into letters in C#, but ToCharArray() separates the diacritics as if they're separate letters (which they're not). I'm fine with either keeping the letters whole with their diacritics, or worst case getting rid of the diacritics altogether.

Example: כֶּלֶב is coming out as 6 different letters.



Solution 1:[1]

The StringInfo class knows about base characters and accents and can handle this:

string s = "??????";
System.Globalization.TextElementEnumerator charEnum = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
    Console.WriteLine(charEnum.GetTextElement());
}

will print 3 lines:

???
??
?

Solution 2:[2]

Strings in C# are stored as arrays of char. That is to say: they are arrays of UTF-16 code units. ToCharArray() just returns that UTF-16 array. And it sometimes takes multiple code units to form a single "symbol".

Would char.GetUnicodeCategory(char) be of any help? Maybe you could split that array on OtherLetter or something (not familiar with Hebrew)?

const string word = "??????";
Console.WriteLine(word.Length);
Console.WriteLine(string.Join(" ", word.ToCharArray().Select(x => (int)x)));
Console.WriteLine(string.Join(" ", word.ToCharArray().Select(char.GetUnicodeCategory)));

Output:

6
1499 1468 1462 1500 1462 1489
OtherLetter NonSpacingMark NonSpacingMark OtherLetter NonSpacingMark OtherLetter

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Hans Kesting
Solution 2 Matt Thomas