MySQL utf8 vs. utf8mb4 – What's the difference?

jakobegger · on July 12, 2017

MySQL has caused so much pain with their decision to make latin1 the default encoding.

I've seen so many PHP apps that store UTF-8 encoded text in Latin1 fields, and it kind of works, until you correctly configure your client library, or try sorting the data, or check for equality...

The problem is so common, that the popular MySQL client app Sequel Pro even has an option to set the encoding to "UTF8 via Latin1", which tells the server to output Latin1 but then treats the result as UTF-8.

So once you learned this the hard way, you'll change the db encoding to utf8, and you'll be glad to have finally fixed the encoding mess, only to stumble across the fact from the original article that utf8 is not really utf8 in MySQL --- oh the horror!

VintageCool · on July 11, 2017

Use utf8mb4. Use utf8mb4_unicode_520_ci. Alter tables using

ALTER TABLE database_name.table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci.

Verify that all of your columns are utf8mb4 with:

SELECT table_schema, table_name, character_set_name, collation_name FROM information_schema.columns WHERE table_schema NOT IN ('mysql','information_schema','sys','performance_schema') AND character_set_name is not NULL AND character_set_name != 'utf8mb4';

You may have to update your client libraries (like Perl-DBD) if you're still using something like CentOS 6.

mixologic · on July 11, 2017

If only they had named them utf8mb3 and utf8. Would have been a lot clearer that one was optimized, but not fully utf8.

blaisio · on July 11, 2017

The next problem you'll likely encounter after switching to utf8mb4 is the relatively tiny maximum index key length in InnoDB - even though the strings are variable length, for the utf8 4 encoding MySQL will multiply the length of the string by 4 when checking against this limit.

If your data model layer supports it, it's often better to just use raw binary instead of any particular encoding. The main features utf8mb4 provides are validation and sorting. Most people are just using an ORM and use Unicode internally in an application, so validation of the encoding is nice, but not essential. And for most applications it doesn't make sense to sort arbitrary utf8 strings in the database layer. (Sorting Unicode strings in general is very complicated, and 99% of the time it's not something you really need.)

thephyber · on July 11, 2017

It seems insane that the top Google SERP result for "mysql utf8"[1] mentions this, but not in the top answer, rather in the top comment to the top answer -- which describes why the top answer is wrong.

Because of the way MySQL rolled out `utf8` (a strict subset of UTF8) then `utf8mb4` (which is a full UTF8 implementation), the other top result[2] is similarly poisoned where the directions describe using `utf8` and have an addendum describing `utf8mb4` (which isn't hard to miss).

[1] https://stackoverflow.com/questions/202205/how-to-make-mysql...

[2] https://www.toptal.com/php/a-utf-8-primer-for-php-and-mysql

petraeus · on July 12, 2017

SELECT CONCAT ( 'ALTER TABLE ', t1.TABLE_SCHEMA, '.', t1.table_name, ' MODIFY ', t1.column_name, ' ', t1.data_type, '(' , CHARACTER_MAXIMUM_LENGTH, ')', ' CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;' ) FROM information_schema.columns t1

WHERE t1.TABLE_SCHEMA like 'omtest' AND t1.COLLATION_NAME IS NOT NULL AND t1.COLLATION_NAME NOT IN ('utf8mb4_unicode_ci')

AND table_name = 'emails';

cntlzw · on July 12, 2017

Never heard of this. Thanks for the article. So many pitfalls working with MySQL.

mdani · on July 11, 2017

utf8 to utf8mb4 also breaks replication so be extra careful if you have slaves replicating. You'd need to switch off row or mixed replication and switch to statement based till changes are done on all nodes.