====== Charactersets ======
Charactersets, a complex and often confusing theme, in which a little light should be brought.
First and foremost, we have to note that at **PHP** we strictly make use of the multibyte encoding by **UTF-8**.
To learn about the internals of UTF-8 you can visit **[[https://en.wikipedia.org/wiki/UTF-8|Wikipedia]]**
\\ Both UTF-8 and UTF-8MB4 from mySQL are 100% compatible to UTF-8 from PHP.
But do not worry, the whole chapter with the character sets, collations and so on is only half as complicated as it looks at first glance.\\
The knowledge of the character sets in the database is only really needed if you want to design new tables yourself.\\
But much more important to programmers is the handling of multibyte string functions in PHP. This can cause serious mistakes if you have not understood the background of the character sets.
===== UTF-8 and PHP =====
PHP does not know any kind of 'Collation' or 'UTF-8mb4' or something else! So forget these in concern to PHP!
UTF-8 in PHP exactly matches the definition in **[[https://tools.ietf.org/html/rfc3629|RFC 3629]]** / **ISO/IEC 10646-1:2000 Annex D** and can consist of one and up to four bytes per character. UTF-8 (as well as all Latin fonts) is bit-compatible with the first 128 characters of the original ASCII character table.
Since a character encoded in UTF-8 can be 1, 2, 3, or even 4 bytes in size, many of the 'old' string functions will no longer work correctly!\\
Please use the corresponding multibyte functions instead! i.e. **mb_strlen()** instead of **strlen()**
**Examples:**
^ Char ^ Codepoint ^ UTF-8 (bin) ^ Naming ^
| a | U+0061 | 01100001 | LATIN SMALL LETTER A |
| á | U+00E1 | 11000011 10100001 | LATIN SMALL LETTER A WITH ACUTE |
| ä | U+00E4 | 11000011 10100100 | LATIN SMALL LETTER A WITH DIAERESIS |
| 🤩 | U+1F929 | 11110000 10011111 10100100 10101001 | GRINNING FACE WITH STAR EYES |
echo strlen('a'); => 1 || echo mb_strlen('a'); => 1
echo strlen('á'); => 2 || echo mb_strlen('á'); => 1
echo strlen('ä'); => 2 || echo mb_strlen('ä'); => 1
echo strlen('🤩'); => 4 || echo mb_strlen('🤩'); => 1
===== UTF-8 and mySQL =====
Unlike PHP, mySQL does not fully implement UTF-8 according to RFC 3629.
In mySQL, UTF-8 can only be one to three bytes maximum. As a result, mySQL can not store UTF-8 4-byte characters.
In order to solve this problem, the implementation of the UTF-8 character set was not changed, but subsequently the character set UTF-8MB4 was introduced. This character set can store any characters up to all 4 bytes.
Now please do not think that these signs would be the same!\\
There is of course a difference that can be very important especially for larger databases.
The small but subtle difference is the memory consumption of these two charsets.
Whereas UTF-8 dynamically reserves only exactly as many bytes as are needed to represent a character, UTF-8MB4 always occupies each character every 4 bytes. Both in memory and in the database. The increased memory requirement also influences the computation time, the accuracy of indices, the size of the index tables and also the access time.
**Example of memory consumption**
^ character set ^ needed memory ^
|"This is a little text in english, containing 54 signs."||
| UTF-8: | 54 Byte |
| UTF-8MB4: | 216 Byte |
|"This is a big text in english, containing 16.000 signs."||
| UTF-8: | 16.000 Byte |
| UTF-8MB4: | 64.000 Byte |
oops, your text is about 200 kB? You have lots of such?
===== UTF-8 and HTML =====
Today, all modern browsers are able to render UTF-8 encoded characters. Therefore, there is usually no longer the need to output special characters in HTML entity format.
In any case, it will be helpful for the browser if at least one of the following meta tags is entered in the HEAD section of the HTML document.
===== Collations =====
\\ Collations does not exist in PHP. These are only properties of text fields in database tables.
A collation has nothing to do with storage at first. It exclusively determines the rules according to which texts are sorted in the output.
Now, if we look at the selection list of possible collations, we find there for each character set a long list of possibilities. On the one hand, there are sortings that are optimized for one particular language and on the other hand values like **_unicode_ci** and **_general_ci**, which work across languages.\\
The difference between **_unicode_ci** and **_general_ci** is mainly a different processing speed.
The sorting with **_general_ci** works much faster, but a bit more inaccurate than with **_unicode_ci**.
Since our databases are relatively small (10,000 entries in a table are indeed very few in the world of databases) and also in terms of multilingualism, we usually use the **_unicode_ci** sorting of the **UTF-8MB4** character set.
But no rule without exception. Text fields that contain only 7-bit characters (ASCII), such as `passwordhash`,` rememberkey` and the like, are defined with the character set '**ascii**' and the collation '**_general_ci**'.
===== Links =====
* [[https://en.wikipedia.org/wiki/UTF-8|Wikipedia UTF-8]]
* [[https://tools.ietf.org/html/rfc3629|RFC 3629]] / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element
* [[https://www.unicode.org/versions/Unicode11.0.0/|The Unicode Standard, Version 11.0]], §3.9 D92, §3.10 D95 (2018 June 5)
* [[http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=63182|ISO/IEC 10646:2014 §9.1]]
* [[http://doc.cat-v.org/plan_9/4th_edition/papers/utf|Original UTF-8 paper]] [[https://web.archive.org/web/20000917055036/http://plan9.bell-labs.com/sys/doc/utf.pdf|(or pdf)]] for [[https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs|Plan 9 from Bell Labs]]