Unicode and UTF-8

frolovdev

Andrey Frolov

Posted on January 6, 2021

Unicode and UTF-8

Long story short

In the 1960s, there were teleprinters and simple devices where you type a key, and it sends a collection of numbers, and the same letter comes out on the other side. But it was a nonstandard solution, so in the mid-1960s, America settled on American Standard Code for Information Interchange (ASCII).

It's a 7-bit binary system. Any number you type in gets converted into 7 binary numbers and sent.

In a nutshell, it means you can have numbers from 0 to 127.

(64) (32) (16) (8) (4) (2) (1)
  0    0    0   0   0   0   0 = 0
  1    1    1   1   1   1   1 = 127
Enter fullscreen mode Exit fullscreen mode

An interesting point here that they've made a clever thing. A in this system is 65, which in binary 1000001:

1000000 = 64
0000001 = 1

A = 64 + 1 = 1000001
Enter fullscreen mode Exit fullscreen mode

Let's see on B and C:

B = 1000010
C = 1000011
Enter fullscreen mode Exit fullscreen mode

And here's the hack, you can just knock off the first two digits and know what its position in the alphabet is. For lowercase, they did 32 number later, which means for a:

a = 97 = 1100001
Enter fullscreen mode Exit fullscreen mode

And it's became a standard for the English speaking world.

New day new problems

What about languages that don't have an alphabet at all? They all came with their own encoding. But with a new day comes new computers. We move to 8-bit computers. So now we have to come up with a whole extra number at the start of every character to encode in 7-bit!

But no one settled on the same standard at this time. Japan goes and creates its own multibyte encoding with more letters and more binaries for each individual character. So from this point, all started to be massively incompatible!

But mostly the time you don't have such problems you just printed a document and faxed it. And then the world wide web hits, and there's a problem document's being sent all over the world. And here, let's move to Unicode Consortium.

Unicode to the rescue

Unicode now has a list of more than a hundred thousand characters, that covers everything you could possibly want to write in any language (even if it's emoji language 😃). As a result, we have Unicode Consortium assigning 100000+ characters to 100000 numbers. They don't do any binary representation; they just said: hey, that Japanese character, that is number 5700 and this Cyrillic character is 1000-something.

About Unicode standard

So in Unicode, we operate with the next terms:

Abstract character - is a unit of information used for the organization, control, or representation of textual data.

Unicode deals with characters as abstract terms. Every abstract character has an associated name, e.g. LATIN SMALL LETTER A. The rendered form (glyph) of this character is a.


Code point - is a number assigned to a single character.
Code points are numbers in the range from U+0000 to U+10FFFF.

U+<hex> is the format of code points, where U+ is a prefix meaning Unicode and <hex> is a number in hexadecimal. For example, U+0041 and U+2603 are code points.

Remember that a code point is a simple number. And that’s how you should think about it. The code point is a kind of index of an element in an array.

The magic happens because Unicode associates a code point with a character. For example U+0041 corresponds to the character named LATIN CAPITAL LETTER A (rendered as A), or U+2603 corresponds to the character named SNOWMAN (rendered as ☃).

Not all code points have associated characters. 1,114,112 code points are available (the range U+0000 to U+10FFFF), but only 137,929 (as of May 2019) have assigned characters.


Code unit - is a bit sequence used to encode each character within a given encoding form.

The character encoding is what transforms abstract code points into physical bits: code units. In other words, the character encoding translates the Unicode code points to unique code unit sequences.

What is UTF

As far as we know, Unicode first and foremost defines a table of code points for characters. That's a fancy way of saying "65 stands for A, 66 stands for B and 9,731 stands for ☃" (seriously, it does). How these code points are actually encoded into bits is a different topic related to UTF encoding.

What problems UTF solves

For encode 100000 characters we need at least 17 (2 ^ 17 ~ 100000) binary digits to encode it, but an English alphabet should be exactly the same (for back-compatibility) - A should be still 65. So if you have just a string of English text, you're encoding it at 32 bits per character. So you have 27 zeros and a few ones only with information. This is incredibly wasteful. So every English text file has to take for times space on the disk.

To summarise:

  • Problem 1. You get rid of all zeros in English text.
  • Problem 2. There are a lot of old computers that interpret 8 zeroes in a row as a NULL, and as a this is the end of the string characters. So if you send 8 zeroes in a row, they just stop listening. So you can't have 8 zeroes in a row everywhere.
  • Problem 3. It has to be backward compatible. If you sent to system UTF encoded string, that only supports ASCII you still should get a valid English text.

How UTF solves such problems

To get started it just use ASCII if you have something under 128, it means that it can be expressed as 7 digits. So in UTF-8 A is encoding same:

A = 01000001 = 65
Enter fullscreen mode Exit fullscreen mode

So it's still UTF and ASCII valid. Now let's going above that, and as you remember, it should still be valid for ASCII. For this we use the next headers:

110 - the start of new character header, two ones means two bytes. A byte being 8 characters

10 - means a continuation

So let's take a look at an example:

 __________________________ ______________________________________
|                          |                                      |
 110         x x x x x       10                    x x x x x x
(the stater) (5 characters)  (continuation header) (6 characters)
Enter fullscreen mode Exit fullscreen mode

So now you can just take all numbers excludes headers and you get

x x x x x = 5 characters
x x x x x x = 6 characters

0 0 1 1 0 <> 1 1 0 0 1 0 = 434
Enter fullscreen mode Exit fullscreen mode

But what about above that?

You go 1110 started header which means that you have 3 bytes. One header and 2 continuation headers:

 _________________ __________________ ________________
|                 |                  |                |
 1110 x x x x       10  x x x x x x    10 x x x x x x
Enter fullscreen mode Exit fullscreen mode

So you can go and even higher specification goes to 1111110x. So this hack avoids waste, it's backward compatible and no point ever sent 8 zeroes in a row.

The bottom line

Thanks for reading the post and for your time. If there're any questions feel free to write a comment below. I know that I added a lot of simplifications, but I'm ready to fix them.

Feel free to ask questions, to express any opinion, and discuss this from your point of view. Make code, not war. ❤️

💖 💪 🙅 🚩
frolovdev
Andrey Frolov

Posted on January 6, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related