Unicode and UTF-8
Andrey Frolov
Posted on January 6, 2021
Long story short
In the 1960s, there were teleprinters and simple devices where you type a key, and it sends a collection of numbers, and the same letter comes out on the other side. But it was a nonstandard solution, so in the mid-1960s, America settled on American Standard Code for Information Interchange (ASCII).
It's a 7-bit binary system. Any number you type in gets converted into 7 binary numbers and sent.
In a nutshell, it means you can have numbers from 0
to 127
.
(64) (32) (16) (8) (4) (2) (1)
0 0 0 0 0 0 0 = 0
1 1 1 1 1 1 1 = 127
An interesting point here that they've made a clever thing. A
in this system is 65
, which in binary 1000001
:
1000000 = 64
0000001 = 1
A = 64 + 1 = 1000001
Let's see on B
and C
:
B = 1000010
C = 1000011
And here's the hack, you can just knock off the first two digits and know what its position in the alphabet is. For lowercase, they did 32
number later, which means for a
:
a = 97 = 1100001
And it's became a standard for the English speaking world.
New day new problems
What about languages that don't have an alphabet at all? They all came with their own encoding. But with a new day comes new computers. We move to 8-bit computers. So now we have to come up with a whole extra number at the start of every character to encode in 7-bit!
But no one settled on the same standard at this time. Japan goes and creates its own multibyte encoding with more letters and more binaries for each individual character. So from this point, all started to be massively incompatible!
But mostly the time you don't have such problems you just printed a document and faxed it. And then the world wide web hits, and there's a problem document's being sent all over the world. And here, let's move to Unicode Consortium
.
Unicode to the rescue
Unicode now has a list of more than a hundred thousand characters, that covers everything you could possibly want to write in any language (even if it's emoji language 😃). As a result, we have Unicode Consortium
assigning 100000+ characters to 100000 numbers. They don't do any binary representation; they just said: hey, that Japanese character, that is number 5700 and this Cyrillic character is 1000-something
.
About Unicode standard
So in Unicode
, we operate with the next terms:
Abstract character
- is a unit of information used for the organization, control, or representation of textual data.
Unicode deals with characters as abstract terms. Every abstract character has an associated name, e.g. LATIN SMALL LETTER A. The rendered form (glyph) of this character is a
.
Code point
- is a number assigned to a single character.
Code points are numbers in the range from U+0000
to U+10FFFF
.
U+<hex>
is the format of code points, where U+ is a prefix meaning Unicode and <hex>
is a number in hexadecimal. For example, U+0041
and U+2603
are code points.
Remember that a code point is a simple number. And that’s how you should think about it. The code point is a kind of index of an element in an array.
The magic happens because Unicode associates a code point with a character. For example U+0041
corresponds to the character named LATIN CAPITAL LETTER A (rendered as A
), or U+2603
corresponds to the character named SNOWMAN (rendered as ☃).
Not all code points have associated characters. 1,114,112
code points are available (the range U+0000
to U+10FFFF
), but only 137,929
(as of May 2019) have assigned characters.
Code unit
- is a bit sequence used to encode each character within a given encoding form.
The character encoding is what transforms abstract code points into physical bits: code units. In other words, the character encoding translates the Unicode code points to unique code unit sequences.
What is UTF
As far as we know, Unicode
first and foremost defines a table of code points for characters. That's a fancy way of saying "65 stands for A, 66 stands for B and 9,731 stands for ☃" (seriously, it does). How these code points are actually encoded into bits is a different topic related to UTF encoding
.
What problems UTF solves
For encode 100000 characters we need at least 17 (2 ^ 17 ~ 100000) binary digits to encode it, but an English alphabet should be exactly the same (for back-compatibility) - A
should be still 65
. So if you have just a string of English text, you're encoding it at 32
bits per character. So you have 27
zeros and a few ones only with information. This is incredibly wasteful. So every English text file has to take for times space on the disk.
To summarise:
- Problem 1. You get rid of all zeros in English text.
- Problem 2. There are a lot of old computers that interpret 8 zeroes in a row as a NULL, and as a
this is the end of the string characters
. So if you send 8 zeroes in a row, they just stop listening. So you can't have 8 zeroes in a row everywhere. - Problem 3. It has to be backward compatible. If you sent to system
UTF
encoded string, that only supportsASCII
you still should get a valid English text.
How UTF solves such problems
To get started it just use ASCII
if you have something under 128
, it means that it can be expressed as 7
digits. So in UTF-8
A
is encoding same:
A = 01000001 = 65
So it's still UTF
and ASCII
valid. Now let's going above that, and as you remember, it should still be valid for ASCII
. For this we use the next headers:
110
- the start of new character header, two ones means two bytes. A byte being 8
characters
10
- means a continuation
So let's take a look at an example:
__________________________ ______________________________________
| | |
110 x x x x x 10 x x x x x x
(the stater) (5 characters) (continuation header) (6 characters)
So now you can just take all numbers excludes headers and you get
x x x x x = 5 characters
x x x x x x = 6 characters
0 0 1 1 0 <> 1 1 0 0 1 0 = 434
But what about above that?
You go 1110
started header which means that you have 3
bytes. One header and 2 continuation headers:
_________________ __________________ ________________
| | | |
1110 x x x x 10 x x x x x x 10 x x x x x x
So you can go and even higher specification goes to 1111110x
. So this hack avoids waste, it's backward compatible and no point ever sent 8 zeroes in a row.
The bottom line
Thanks for reading the post and for your time. If there're any questions feel free to write a comment below. I know that I added a lot of simplifications, but I'm ready to fix them.
Feel free to ask questions, to express any opinion, and discuss this from your point of view. Make code, not war. ❤️
Posted on January 6, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.