'g_utf8_collate() returns 0 for two Japanese strings which are not equal
The environment on Linux SLES 15SP2:
$ egrep 'LANG|LC_' catserver.log_SRP-30932.20220209
export LC_ALL=de_DE.UTF-8
export DB_LANG=de_DE.UTF-8
export LANG=de_DE.UTF-8
result of comparing two Japanese strings p1 and p2 with g_utf8_collate() and with strcmp() and the hex representation of the two strings:
p1: [ゲルハルト・A・リッター] p2: [ゲアハルト・A・リッター] g_utf8_collate(): 0 strcmp(): 1
p1: e382b2e383abe3838fe383abe38388e383bb41e383bbe383aae38383e382bfe383bc
p2: e382b2e382a2e3838fe383abe38388e383bb41e383bbe383aae38383e382bfe383bc
...
p1: [チャールズ・A・ビアード] p2: [ゲルハルト・A・リッター] g_utf8_collate(): 0 strcmp(): 1
p1: e38381e383a3e383bce383abe382bae383bb41e383bbe38393e382a2e383bce38389
p2: e382b2e383abe3838fe383abe38388e383bb41e383bbe383aae38383e382bfe383bc
I don't know what these Japanese strings mean. They're from a bibliographic database where our Library Management System is failing due to the above problem with g_utf8_collate() and I've inserted the strcmp() and the hex dump in the tons of C-code to understand the failure.
Solution 1:[1]
Background
Look at the pair: only one character is different:
- ??????A????? = Geruharuto A ritt?
- ??????A????? = geaharuto A ritt?
These are two ways to write the name Gerhard A. Ritter in Katakanas. Japanese speakers do that for the same reason why English speakers would create the name Maria Sharapova although the real name is ?????? ????????? - it is the closest way to write that name in a different system. And sometimes there is not only one way to do so.
In this case both ways to write "Gerhard" can be considered same. However, I cannot imagine a collation that exists which interpretes two entirely different Kanas as same - this cannot be correct.
Now the other pair, which has nothing in common besides the latin
A:- ??????A????? = ch?ruzu A bi?do = Charles A. Beard
- ??????A????? = geruharuto A ritt? (same as first p1)
Charles also has a japanese article in Wikipedia, so you can see his name exactly written like here. These are two different persons and everything besides
?A?mismatches - in no way can any collation see these as same - there must be a bigger mistake that happens.
Library wise you just have to expect different writing systems for the same name, especially when wanting to look things up/find them (again), since there's no rule whose name must be written in which writing system. Examples:
| Latin letters | Katakana (japanese) | Hebrew | Cyrillic (russian) | Arabic | Greek |
|---|---|---|---|---|---|
| Ken Ishii | ?????? | ??_????? | ??? ???? | ??? ???? | ??? ???? |
| Michael Schumacher | ??????????? | ????? ?????? | ???????? ???????? | ????? ?????? | ?????? ???????? |
Solution
glib wise there's
- a Red Hat ticket describing similar impossible results, and
- the question Glib::ustring and Japanese characters along with answers.
Both point out: you cannot have a locale of de_DE.UTF-8 and then collation wise compare non-German input. If you want to use g_utf8_collate() on Katakanas you have to set your locale to e.g. ja_JP.UTF-8.
Your new problem may become: recognize the writing system from the characters and set the locale accordingly. However, this already easily clashes with latin letters being used in many alphabets (English, German, Turkish...).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | AmigoJack |
