MySQL has caused so much pain with their decision to make latin1 the default encoding.
I've seen so many PHP apps that store UTF-8 encoded text in Latin1 fields, and it kind of works, until you correctly configure your client library, or try sorting the data, or check for equality...
The problem is so common, that the popular MySQL client app Sequel Pro even has an option to set the encoding to "UTF8 via Latin1", which tells the server to output Latin1 but then treats the result as UTF-8.
So once you learned this the hard way, you'll change the db encoding to utf8, and you'll be glad to have finally fixed the encoding mess, only to stumble across the fact from the original article that utf8 is not really utf8 in MySQL --- oh the horror!
Use utf8mb4. Use utf8mb4_unicode_520_ci. Alter tables using
ALTER TABLE database_name.table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci.
Verify that all of your columns are utf8mb4 with:
SELECT table_schema, table_name, character_set_name, collation_name
FROM information_schema.columns
WHERE table_schema NOT IN ('mysql','information_schema','sys','performance_schema')
AND character_set_name is not NULL
AND character_set_name != 'utf8mb4';
You may have to update your client libraries (like Perl-DBD) if you're still using something like CentOS 6.
The next problem you'll likely encounter after switching to utf8mb4 is the relatively tiny maximum index key length in InnoDB - even though the strings are variable length, for the utf8 4 encoding MySQL will multiply the length of the string by 4 when checking against this limit.
If your data model layer supports it, it's often better to just use raw binary instead of any particular encoding. The main features utf8mb4 provides are validation and sorting. Most people are just using an ORM and use Unicode internally in an application, so validation of the encoding is nice, but not essential. And for most applications it doesn't make sense to sort arbitrary utf8 strings in the database layer. (Sorting Unicode strings in general is very complicated, and 99% of the time it's not something you really need.)
It seems insane that the top Google SERP result for "mysql utf8"[1] mentions this, but not in the top answer, rather in the top comment to the top answer -- which describes why the top answer is wrong.
Because of the way MySQL rolled out `utf8` (a strict subset of UTF8) then `utf8mb4` (which is a full UTF8 implementation), the other top result[2] is similarly poisoned where the directions describe using `utf8` and have an addendum describing `utf8mb4` (which isn't hard to miss).
utf8 to utf8mb4 also breaks replication so be extra careful if you have slaves replicating. You'd need to switch off row or mixed replication and switch to statement based till changes are done on all nodes.
I've seen so many PHP apps that store UTF-8 encoded text in Latin1 fields, and it kind of works, until you correctly configure your client library, or try sorting the data, or check for equality...
The problem is so common, that the popular MySQL client app Sequel Pro even has an option to set the encoding to "UTF8 via Latin1", which tells the server to output Latin1 but then treats the result as UTF-8.
So once you learned this the hard way, you'll change the db encoding to utf8, and you'll be glad to have finally fixed the encoding mess, only to stumble across the fact from the original article that utf8 is not really utf8 in MySQL --- oh the horror!