16 07 | 2012

MySQL save/restore: an example of what not to do

Written by Tanguy

Classified in : Homepage, Debian, Command line, Grumble

Database icon

I had to play a bit with MySQL, saving and restoring databases with mysqldump and mysql. These two tools have one severe discrepancy, which I would like to comment here because I think it is an excellent example of what not to do when designing tools to be used together.

Save

mysqldump(1) is a database backup tool, that takes a MySQL database and prints the necessary SQL statements to restore it to a new server. It has one interesting option:

--default-character-set=charset_name

Use charset_name as the default character set. See Section 10.5, “Character Set Configuration”. If no character set is specified, mysqldump uses utf8.

As you can see, it is quite modern in its default choice: if you do not specify which encoding you want to encode your database content into, it uses UTF-8. This way, it can save any database content, whatever characters it may contain.

A more sensible choice may have been to use the LC_CTYPE environment variable.

Restore

mysql(1) is the command-line MySQL client, which can is used to restore such dumps. It has a similar option:

--default-character-set=charset_name

Use charset_name as the default character set for the client and connection.

A common issue that can occur when the operating system uses utf8 or another multi-byte character set is that output from the mysql client is formatted incorrectly, due to the fact that the MySQL client uses the latin1 character set by default. You can usually fix such issues by using this option to force the client to use the system character set instead.

As you can see, this utility is still has an antique design: if you do not specify a character set, it assumes you are giving it data encoded in latin-1.

Discrepancy

You can see the obvious discrepancy here: mysqldump encodes your database backups in UTF-8 by default, and mysql restores them as if they were encoded in latin-1. This way, when you have to restore your databases, if you thought you could do it without wondering, congratulations: voilà, you have just fucked your databases up! And as an additional present, chances are high that this mistake is discreet enoughto be detected only after a while, when modifications have been done to the restored database and the original backups have been deleted.

So, if you are dealing with MySQL databases, remember that you must take care of the encoding of your backups, because MySQL utilities will not by itself, or rather it will do just what should not be done. And if you are designing similar tools, please think, and if you want an example of what not to do, here is one!

6 comments

tuesday 17 july 2012 à 07:20 Marcus said : #1

Unless of course you are seriously dealing with mysql, then you just take binary backups and not have this problem.

tuesday 17 july 2012 à 10:27 Nicolas said : #2

« As you can see, it is quite modern in its default choice: if you do not specify which encoding you want to encode your database content into, it uses UTF-8. This way, it can save any database content, whatever characters it may contain. »

afaik, there's a BIG trap here. According to the mysql documentation the "mysql utf8" is a only a 3bytes subset of the 4bytes "ietf utf8" set : it can only encode BMP values (thoses encodable on 3 bytes), if you want to save valid unicode values from SMP (thoses who need 4 bytes ; like "Mathematical Alphanumeric Symbols" set or the newly added "Emoticons" set) you should use the "mysql utf8mb4".

* https://dev.mysql.com/doc/refman/5.6/en/charset-unicode-utf8mb4.html
* https://tools.ietf.org/html/rfc3629#section-3
* https://en.wikipedia.org/wiki/Plane_(Unicode)

Talk about obvious discrepancy :p

tuesday 17 july 2012 à 13:36 mirabilos said : #3

LOL @ article

ROTFLMAO @ comment from Nicolas

Anyway, please do not write “dealing with MySQL databases”, it’s not a database.

tuesday 17 july 2012 à 13:54 Tanguy said : #4

@Nicolas : I was still thinking MySQL did not support the astral planes above the BMP at all, thanks you for the information.

@mirabilos : You would not have anything constructive to say at all, would you?

wednesday 18 july 2012 à 14:21 mirabilos said : #5

@Tanguy: funnily, you tend to choose topics I can’t help but “troll” on (although this is not strictly trolling; my utterances usually have some amount of truth behind them).

Sorry about that.

I really respect that you write a lot about technical stuff, but most of the time I find something (usually minor, which is why I rarely comment on them) wrong with the posts, technically, that bugs me. I have made the experience that half-knowledge can hurt worse than being unknowledgeable.

Really, sorry. Nothing personal, and even nothing against your posts. This says more about me than about you.

wednesday 18 july 2012 à 14:58 Tanguy said : #6

@mirabilos : No problem, just expect that I make fun of your comments from time to time then. :-)

(and do not hesitate to correct me when I say anything wrong, of course)

Write a comment

What is the last letter of the word eeiwc? : 

Archives