'Proper Charset to work with Vietnamese Characters (that isn't Unicode) in PHP [duplicate]

I've searched around for a while and haven't yet found something that'll work for me. I am using a PHP form to submit data into SAP using the SAP DI API. I need to figure out which character set will actually allow me to store and work with Vietnamese characters.

UTF8 seems to work for a lot of the characters but ô becomes ô. More importantly, there are character limits, and UTF-8 breaks character limits. If I have a string of 30 characters it tells the API that it's more than 50. The same is true for storing in MySQL--if there's a varchar character limit, UTF-8 causes the string to go above it.

Unfortunately, when I search, UTF-8 seems to be the only thing people suggest for Vietnamese characters. If I don't encode the characters at all, they get stored as their html character codes. I've also tried ISO-8859-1, converting into UCS-2 or UCS-4... I'm really at a loss. If anyone has experience working with vietnamese characters, your help would be greatly appreciated.

UPDATE

It appears the issue may be with my wampserver on Windows. here's a bit of code that is confusing me:

$str = 'VậTCôNG';
$str1 = utf8_encode($str);
if (mb_detect_encoding($str,"UTF-8",true) == true) {
    print_r('yes');
    if ($str1 == $str) {
        print_r('yes2');
    }
}
echo $str . $str1;

This prints "yes" but not "yes2", and $str.str1 = "VậTCôNGVậTCôNG" in the browser.

I have my php.ini file with:

default_charset = "utf-8"

and my httpd.conf file with:

AddDefaultCharset UTF-8

and my php file I'm running has:

header("Content-type: text/html; charset=utf-8");

So I'm now wondering: if the original string was utf-8, why wouldn't it equal a utf8 encoding of itself? and why is the utf8 encoding returning wrong characters? Is something wrong in the wampserver configurations?



Solution 1:[1]

ô is the "Mojibake" for ô. That is, you do have UTF-8, but something in the code mangled it.

See Trouble with utf8 characters; what I see is not what I stored and search for Mojibake. It says to check these:

  • The bytes to be stored need to be UTF-8-encoded. Fix this.
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
  • HTML should start with <meta charset=UTF-8>.

It is possible to recover the data in the database, but it depends on details not yet provided.

http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

Each Vietnamese character take 2-3 bytes for encoding in UTF-8. It is unclear whether the "hard 50" is really a character limit or a byte limit.

If you happen to have Mojibake's sibling "double encoding", then a Vietnamese character will take 4-6 bytes and feel like 2-3 characters. See "Test the data" in the first link.

An example of how to 'undo' Mobibake in MySQL: CONVERT(BINARY(CONVERT('VậTCôNG' USING latin1)) USING utf8mb4) --> 'V?TCôNG'

"Double encoding" is sort of like Mojibake twice. That is one side treats it as latin1, the other as UTF-8, but twice.

V?TCôNG, as UTF-8, is hex 56e1baad5443c3b44e47. If that hex is treated as character set cp850 or keybcs2, the string is Vß?¡TC??NG.

Solution 2:[2]

Change it to VISCII.

Input: ô 
Output: ô

You can test it at Charset converter.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community
Solution 2 r0xette