Edit Rename Changes History Upload Download Back to Top

UTF8

UTF8 is way of encoding characters.

Unicode and ISO/IEC 10646 specify a range of "code points". Think of a code point as a position on a number line running from 0 to 1,114,112. Each integer on the line can represent a character, e.g. 42 (Hex 2A) is * (an asterisk). Committees spend many happy conferences discussing what character should map to what number, and whether a character that looks the same is really the same etc. What matters is that for users of unicode, code point == character.

Unicode has not been around since the first days of computing. Before it came things like ASCII which worked on the idea that a single character would be held in a single byte. Lots and lots of software is written using the idea that 1 byte is 1 character. The problem is that the biggest numbers in the Unicode number line won't fit in a single byte - they need up to three bytes (hint: look at 1,114,112 in hex).

UTF8 takes an integer from the Unicode number line and converts that into a sequence of bytes (more formally, octets). One of the great things about UTF8, though, is that for the most common ASCII characters (the first 127) the encoded UTF8 form is exactly the same as normal, plain ASCII.

For greater detail, see the RFCs and Wikipedia.

UTF8 RFCs


Edit Rename Changes History Upload Download Back to Top