'mysqldump exporting data in a bad character set

Yesterday for the first time I exported my Mysql database and I found some very strange characters in the dump such as:

INSERT INTO `piwik_archive_blob_2013_01` VALUES (15,'Actions_actions_url_6',1,'2013-01-17','2013-01-17',1,'2013-01-20 07:36:53','xuNM0ý/œ#&ÝÕ³\ZõNYpÊÀì#!üw7Hж}°ÀAáZoN*šgµ\'GWª[Yûðe¯57 ÃÁÆ7|Ÿ\'Ü%µDh©-EÛ^ËL±ÕÞtªk@(,b±ßZ.ÒÃ6b²aiÓÍ)87[­ïΜ,æya¥uÒ<|+íª7MNuïÝ¿8ñ%1Ʊ>Ú­X');

The version of my server MySQL is: 5.1.66-0+squeeze1 (Debian). This database was created automatically by the Piwik setup script.

Here is what I tried to fix this problem:

#1 First I checked the database charset.

> show table status;

The 26 tables has the collation utf8_general_ci which sounded quite normal. I guessed mysqldump exported in a different charset (latin1?) So I tried:

mysqldump -u user -p**** --all-databases --default-character-set=utf8 | gzip -9 > dump.sql.gz

Result = I still had the same strange characters.

Note) later I learned that the default charset for mysqldump is utf8, regardless of server default charset. So --default-character-set=utf8 was useless.

#2 Then I thought I could solve the problem by updating the mysql conf. The original conf was:

mysql> show variables like "%character%";show variables like "%collation%";

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+

So I updated /var/lib/mysql/my.cnf and I added:

[mysqld]
init_connect='SET collation_connection = utf8_general_ci'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_ci
default-character-set=utf8
default-collation=utf8_general_ci

[mysqldump]
default-character-set=utf8

Then

/etc/init.d/mysql restart
mysql> show variables like "%character%";show variables like "%collation%";

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | utf8_general_ci   |
+----------------------+-------------------+

Result = same strange characters.

#3 I changed character_set_database and collation_database:

mysql> ALTER DATABASE piwik default character SET utf8 collate utf8_general_ci;

mysql> show variables like "%character%";show variables like "%collation%";

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+

Result = same strange characters.

#4 I understand now I should have changed the default character set in MySQL (latin1) to utf8 before creating the database.

The collation utf8_general_ci (#1) means data is stored in utf8. However, would it be possible that mysqldump thinks data is stored in latin1 and encodes data in utf8? It would mean at the end data is double utf8 encoded (sigh). In this case how could I fix the problem?

Thanks for your help.

ps) I wonder why Piwik doesn't require to change the database default charset to utf8.



Solution 1:[1]

According to the table name "piwik_archive_blob_2013_01", I guess the column containing the strange characters is of type BLOB.

BLOB columns contains binary data. That's why it contains these strange characters. This is expected.

Don't worry about it, I'm pretty sure MysqlDump knows how to dump this data.

Cheers, Eric.

Solution 2:[2]

It could be the operating system is changing the charset during the export and ignoring default-character-set parameter.

To make sure that the export is not using operating system charset use the parameter result-file

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Eric Citaire
Solution 2 jdrake