'Proper Charset to work with Vietnamese Characters (that isn't Unicode) in PHP [duplicate]
I've searched around for a while and haven't yet found something that'll work for me. I am using a PHP form to submit data into SAP using the SAP DI API. I need to figure out which character set will actually allow me to store and work with Vietnamese characters.
UTF8 seems to work for a lot of the characters but ô becomes ô. More importantly, there are character limits, and UTF-8 breaks character limits. If I have a string of 30 characters it tells the API that it's more than 50. The same is true for storing in MySQL--if there's a varchar character limit, UTF-8 causes the string to go above it.
Unfortunately, when I search, UTF-8 seems to be the only thing people suggest for Vietnamese characters. If I don't encode the characters at all, they get stored as their html character codes. I've also tried ISO-8859-1, converting into UCS-2 or UCS-4... I'm really at a loss. If anyone has experience working with vietnamese characters, your help would be greatly appreciated.
UPDATE
It appears the issue may be with my wampserver on Windows. here's a bit of code that is confusing me:
$str = 'VậTCôNG';
$str1 = utf8_encode($str);
if (mb_detect_encoding($str,"UTF-8",true) == true) {
print_r('yes');
if ($str1 == $str) {
print_r('yes2');
}
}
echo $str . $str1;
This prints "yes" but not "yes2", and $str.str1 = "VậTCôNGVáºTCôNG" in the browser.
I have my php.ini file with:
default_charset = "utf-8"
and my httpd.conf file with:
AddDefaultCharset UTF-8
and my php file I'm running has:
header("Content-type: text/html; charset=utf-8");
So I'm now wondering: if the original string was utf-8, why wouldn't it equal a utf8 encoding of itself? and why is the utf8 encoding returning wrong characters? Is something wrong in the wampserver configurations?
Solution 1:[1]
ô is the "Mojibake" for ô. That is, you do have UTF-8, but something in the code mangled it.
See Trouble with utf8 characters; what I see is not what I stored and search for Mojibake. It says to check these:
- The bytes to be stored need to be UTF-8-encoded. Fix this.
- The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
- The column needs to be declared
CHARACTER SET utf8(or utf8mb4). Fix this. - HTML should start with
<meta charset=UTF-8>.
It is possible to recover the data in the database, but it depends on details not yet provided.
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Each Vietnamese character take 2-3 bytes for encoding in UTF-8. It is unclear whether the "hard 50" is really a character limit or a byte limit.
If you happen to have Mojibake's sibling "double encoding", then a Vietnamese character will take 4-6 bytes and feel like 2-3 characters. See "Test the data" in the first link.
An example of how to 'undo' Mobibake in MySQL:
CONVERT(BINARY(CONVERT('VáºTCôNG' USING latin1)) USING utf8mb4) --> 'V?TCôNG'
"Double encoding" is sort of like Mojibake twice. That is one side treats it as latin1, the other as UTF-8, but twice.
V?TCôNG, as UTF-8, is hex 56e1baad5443c3b44e47. If that hex is treated as character set cp850 or keybcs2, the string is Vß?¡TC??NG.
Solution 2:[2]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Community |
| Solution 2 | r0xette |
