Dev 101: Unicode

What is it?

Unicode is a standard of encoding sets and characters established by The Unicode Consortium. It is the most widely used standard, makes internationalization easier, and the reason we can have emoji 😍

Why do we need it?

If you gather, display, transmit or otherwise use strings on the internet, desktop software, or mobile apps, then you need to encode them. Unicode maintains the most popular standard, UTF-8 [1]. Understanding it is key to being a competent developer[2]). (It was also named one of the 12 Things Every Junior Developer Should Learn[3]) Failing to understand and properly use it can result in those empty boxes, odd-looking question marks, and general frustration for you and whoever wants to use whatever it is you've coded up.

Where did it come from?

Let's back up and talk about where letters come from. Spoken language is a way we humans have figured out how to encode meaning in sound. If I say "Hello" you understand that I'm greeting you. Written language encodes those sounds, so when I write "World", you understand I'm talking about our planet and all of humanity. That is if you understand the spoken and written language I'm using.

If you don't understand the language I'm using then my sounds or arrangement of symbols mean nothing. Computers don't understand human languages, they operate using binary. So if we want to use a computer to gather, display, transmit or otherwise understand a human language we have to give the instructions to it in binary.

Encodings between binaries and characters make this possible. Different types of computers used to have different encodings and this worked just fine until we wanted to share information between computers. To do that we need to have an encoding standard that both computers share so that when a set of binaries is received the correct letters are displayed or save. In 1963 the American Standard Code for Information Interchange (ASCII) became that standard.

ASCII has uppercase, lowercase, punctuation, symbols, and control codes. It lacked accented characters like è, non-American symbols like £ or any non-latin characters, but it got the job done. The ASCII set was 128 8-bit characters which left 128 characters free. This lead to a hybrid of pre-standard and standard encoding evolved. The first 128 characters were the ASCII standard, the last 128 bits used by different groups to encode different symbols and letters.

Clashes between the way different machines used those last 128 characters happened, but it was good enough, and ASCII was the standard for three decades. Support for many more languages and their characters lead to the publication of Unicode in 1991. Unicode has evolved a bit over the years, from all 16-bit characters to the current standard, UTF-8, which is variable length encoding.

In UTF-8 the first 128 bytes are just like ASCII. This means that the most commonly used characters on the internet still just use one byte per character. After that characters aren't limited to one byte, instead they can be two, three or four! Having the set of characters supported by ASCII take up the least amount of space in UTF-8 makes sense given that most of the internet uses UTF-8 (93% of websites), but it still gives us access to all the other characters that Unicode supports.

So what doesn't Unicode do well? Most of the Unicode issues relate to Unicode providing single characters which are then displayed in different ways depending on their fonts. Characters that are different but look similar make homograph attacks possible. Chinese, Japanese, & Korean all share a character set in Unicode, relying on different fonts to differentiate between the way each language displays those characters. In languages like Arabic and Vietnamese single characters are connected with ligatures to make glyphs, meaning that any given character might look different depending on what characters it's connected to (think cursive writing in English). For these languages Unicode and fonts aren't enough and secondary processing needs to be done to display them correctly.

And last, but not least, the new international language, emoji, are even subject to this font issue. Emoji are a bit more consistent than they used to be, but differences still exist between platforms. The most notable difference is the the dizzy face, which reads more like death on some platforms.

Summary

You should care about character encoding becuase it is how we store and display language on computers.
Unicode, and specifically UTF-8 is the most widly used character encoding on the internet.
It uses variable length encoding to give us fast loading for what we use most, while also providing us with more than 137,000 other characters.
Unicode, like ASCII before it, is character encoding only, relying on fonts for language diferentiation and emoji style.
Emoji are 🎉 🙌🏼 💻 🔥 😍

Discuss
What's your fav emoji? What are your top used emoji? What do they mean to you/what are you trying to say with them when you use them?

Blog

Ray

What is it?

Why do we need it?

Where did it come from?

Join Our Newsletter. No Spam, Only the good stuff.

Related