User Tools

Site Tools


en:dev:284:charsets

Charactersets

Charactersets, a complex and often confusing theme, in which a little light should be brought. First and foremost, we have to note that at PHP we strictly make use of the multibyte encoding by UTF-8. To learn about the internals of UTF-8 you can visit Wikipedia


Both UTF-8 and UTF-8MB4 from mySQL are 100% compatible to UTF-8 from PHP.

But do not worry, the whole chapter with the character sets, collations and so on is only half as complicated as it looks at first glance.
The knowledge of the character sets in the database is only really needed if you want to design new tables yourself.
But much more important to programmers is the handling of multibyte string functions in PHP. This can cause serious mistakes if you have not understood the background of the character sets.

UTF-8 and PHP

PHP does not know any kind of 'Collation' or 'UTF-8mb4' or something else! So forget these in concern to PHP! UTF-8 in PHP exactly matches the definition in RFC 3629 / ISO/IEC 10646-1:2000 Annex D and can consist of one and up to four bytes per character. UTF-8 (as well as all Latin fonts) is bit-compatible with the first 128 characters of the original ASCII character table.

Since a character encoded in UTF-8 can be 1, 2, 3, or even 4 bytes in size, many of the 'old' string functions will no longer work correctly!
Please use the corresponding multibyte functions instead! i.e. mb_strlen() instead of strlen()

Examples:

Char Codepoint UTF-8 (bin) Naming
a U+0061 01100001 LATIN SMALL LETTER A
á U+00E1 11000011 10100001 LATIN SMALL LETTER A WITH ACUTE
ä U+00E4 11000011 10100100 LATIN SMALL LETTER A WITH DIAERESIS
🤩 U+1F929 11110000 10011111 10100100 10101001 GRINNING FACE WITH STAR EYES

<PHP> echo strlen('a'); ⇒ 1 || echo mb_strlen('a'); ⇒ 1 echo strlen('á'); ⇒ 2 || echo mb_strlen('á'); ⇒ 1 echo strlen('ä'); ⇒ 2 || echo mb_strlen('ä'); ⇒ 1 echo strlen('🤩'); ⇒ 4 || echo mb_strlen('🤩'); ⇒ 1 </PHP>

UTF-8 and mySQL

Unlike PHP, mySQL does not fully implement UTF-8 according to RFC 3629. In mySQL, UTF-8 can only be one to three bytes maximum. As a result, mySQL can not store UTF-8 4-byte characters. In order to solve this problem, the implementation of the UTF-8 character set was not changed, but subsequently the character set UTF-8MB4 was introduced. This character set can store any characters up to all 4 bytes. Now please do not think that these signs would be the same!
There is of course a difference that can be very important especially for larger databases.

The small but subtle difference is the memory consumption of these two charsets. Whereas UTF-8 dynamically reserves only exactly as many bytes as are needed to represent a character, UTF-8MB4 always occupies each character every 4 bytes. Both in memory and in the database. The increased memory requirement also influences the computation time, the accuracy of indices, the size of the index tables and also the access time.

Example ​of memory consumption

character set needed memory
“This is a little text in english, containing 54 signs.”
UTF-8: 54 Byte
UTF-8MB4: 216 Byte
“This is a big text in english, containing 16.000 signs.”
UTF-8: 16.000 Byte
UTF-8MB4: 64.000 Byte

oops, your text is about 200 kB? You have lots of such?

UTF-8 and HTML

Today, all modern browsers are able to render UTF-8 encoded characters. Therefore, there is usually no longer the need to output special characters in HTML entity format.

In any case, it will be helpful for the browser if at least one of the following meta tags is entered in the HEAD section of the HTML document. <PHP><meta http-equiv=“content-type” content=“text/html;charset=utf-8”>

<meta charset=“UTF-8”></PHP>

Collations


Collations does not exist in PHP. These are only properties of text fields in database tables.

A collation has nothing to do with storage at first. It exclusively determines the rules according to which texts are sorted in the output.

Now, if we look at the selection list of possible collations, we find there for each character set a long list of possibilities. On the one hand, there are sortings that are optimized for one particular language and on the other hand values ​​like _unicode_ci and _general_ci, which work across languages.
The difference between _unicode_ci and _general_ci is mainly a different processing speed. The sorting with _general_ci works much faster, but a bit more inaccurate than with _unicode_ci. Since our databases are relatively small (10,000 entries in a table are indeed very few in the world of databases) and also in terms of multilingualism, we usually use the _unicode_ci sorting of the UTF-8MB4 character set. But no rule without exception. Text fields that contain only 7-bit characters (ASCII), such as `passwordhash`,` rememberkey` and the like, are defined with the character set 'ascii' and the collation '_general_ci'.

en/dev/284/charsets.txt · Last modified: 15.11.2018 12:22 by Manuela v.d.Decken