Charactersets

Charactersets, a complex and often confusing theme, in which a little light should be brought. First and foremost, we have to note that at PHP we strictly make use of the multibyte encoding by UTF-8. To learn about the internals of UTF-8 you can visit Wikipedia

Both UTF-8 and UTF-8MB4 from mySQL are 100% compatible to UTF-8 from PHP.

But do not worry, the whole chapter with the character sets, collations and so on is only half as complicated as it looks at first glance.
The knowledge of the character sets in the database is only really needed if you want to design new tables yourself.
But much more important to programmers is the handling of multibyte string functions in PHP. This can cause serious mistakes if you have not understood the background of the character sets.

UTF-8 and PHP

PHP does not know any kind of 'Collation' or 'UTF-8mb4' or something else! So forget these in concern to PHP! UTF-8 in PHP exactly matches the definition in RFC 3629 / ISO/IEC 10646-1:2000 Annex D and can consist of one and up to four bytes per character. UTF-8 (as well as all Latin fonts) is bit-compatible with the first 128 characters of the original ASCII character table.

Since a character encoded in UTF-8 can be 1, 2, 3, or even 4 bytes in size, many of the 'old' string functions will no longer work correctly!
Please use the corresponding multibyte functions instead! i.e. mb_strlen() instead of strlen()

Examples:

Char	Codepoint	UTF-8 (bin)	Naming
a	U+0061	01100001	LATIN SMALL LETTER A
á	U+00E1	11000011 10100001	LATIN SMALL LETTER A WITH ACUTE
ä	U+00E4	11000011 10100100	LATIN SMALL LETTER A WITH DIAERESIS
🤩	U+1F929	11110000 10011111 10100100 10101001	GRINNING FACE WITH STAR EYES

<PHP> echo strlen('a'); ⇒ 1 || echo mb_strlen('a'); ⇒ 1 echo strlen('á'); ⇒ 2 || echo mb_strlen('á'); ⇒ 1 echo strlen('ä'); ⇒ 2 || echo mb_strlen('ä'); ⇒ 1 echo strlen('🤩'); ⇒ 4 || echo mb_strlen('🤩'); ⇒ 1 </PHP>

UTF-8 and mySQL

Unlike PHP, mySQL does not fully implement UTF-8 according to RFC 3629. In mySQL, UTF-8 can only be one to three bytes maximum. As a result, mySQL can not store UTF-8 4-byte characters. In order to solve this problem, the implementation of the UTF-8 character set was not changed, but subsequently the character set UTF-8MB4 was introduced. This character set can store any characters up to all 4 bytes. Now please do not think that these signs would be the same!
There is of course a difference that can be very important especially for larger databases.

The small but subtle difference is the memory consumption of these two charsets. Whereas UTF-8 dynamically reserves only exactly as many bytes as are needed to represent a character, UTF-8MB4 always occupies each character every 4 bytes. Both in memory and in the database. The increased memory requirement also influences the computation time, the accuracy of indices, the size of the index tables and also the access time.

Example of memory consumption

character set	needed memory
“This is a little text in english, containing 54 signs.”
UTF-8:	54 Byte
UTF-8MB4:	216 Byte
“This is a big text in english, containing 16.000 signs.”
UTF-8:	16.000 Byte
UTF-8MB4:	64.000 Byte

oops, your text is about 200 kB? You have lots of such?

UTF-8 and HTML

Today, all modern browsers are able to render UTF-8 encoded characters. Therefore, there is usually no longer the need to output special characters in HTML entity format.

In any case, it will be helpful for the browser if at least one of the following meta tags is entered in the HEAD section of the HTML document. <PHP><meta http-equiv=“content-type” content=“text/html;charset=utf-8”>

Collations

Collations does not exist in PHP. These are only properties of text fields in database tables.

A collation has nothing to do with storage at first. It exclusively determines the rules according to which texts are sorted in the output.

Now, if we look at the selection list of possible collations, we find there for each character set a long list of possibilities. On the one hand, there are sortings that are optimized for one particular language and on the other hand values like _unicode_ci and _general_ci, which work across languages.
The difference between _unicode_ci and _general_ci is mainly a different processing speed. The sorting with _general_ci works much faster, but a bit more inaccurate than with _unicode_ci. Since our databases are relatively small (10,000 entries in a table are indeed very few in the world of databases) and also in terms of multilingualism, we usually use the _unicode_ci sorting of the UTF-8MB4 character set. But no rule without exception. Text fields that contain only 7-bit characters (ASCII), such as `passwordhash`,` rememberkey` and the like, are defined with the character set 'ascii' and the collation '_general_ci'.

Links

Wikipedia UTF-8
RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element
The Unicode Standard, Version 11.0, §3.9 D92, §3.10 D95 (2018 June 5)
ISO/IEC 10646:2014 §9.1
Original UTF-8 paper (or pdf) for Plan 9 from Bell Labs

WebsiteBaker Documentation

Sidebar

Installation

Users area

Designers Area

Developers Area

General

Changes / News

Wiki

Table of Contents

Charactersets

UTF-8 and PHP

UTF-8 and mySQL

UTF-8 and HTML

Collations

Links

WebsiteBaker Documentation

User Tools

Site Tools

Sidebar

Installation

Users area

Designers Area

Developers Area

General

Changes / News

Wiki

Table of Contents

Charactersets

UTF-8 and PHP

UTF-8 and mySQL

UTF-8 and HTML

Collations

Links

Page Tools