Paweł bbkr Pabian
Posted on July 27, 2023
Long time ago computers were using mostly ASCII encoding. ASCII stands for American Standard Code for Information Interchange and was very early (first revision published in 1963) attempt to unify binary representation of a text.
Originally it used 7 bits, allowing to store 2^7=128
characters. Lowercase and uppercase Latin letters, Arabic numerals and punctuations were included. Because it was used at first for teleprinter and teletype machines it contained also a lot of control characters, which do not have graphical representation but have some effect like moving to new line or confirming transmission. Here is the full list:
But what if someone needed to write another character? This is where creativity kicked in. ASCII was using 7 bits, but computers were mostly using 8 bits as base registry size. Meaning that the smallest chunk of data interchanged between CPU, RAM and disk was much bigger (2^8=256
combinations) than single ASCII character. This excessive space was used to store characters not included in base ASCII.
However that extra space in single byte could store only 128
additional characters. Way too few to write accented versions of Latin like ę
, Cyrillic like Д
, Kanji like 鰯
, Greek, Cherokee, math symbols, dead alphabets like Runic, etc. Various encodings were created as ASCII extensions, each held in this extra 128
space only characters needed for specific use case. What seemed like brilliant idea was actually a poison that tormented computer industry for decades with:
Issue 1 - Operating system incompatibility
Some encodings were standardized by ISO, but Microsoft and Apple went their own way. So the same character ñ
had 2 different binary values: 0xF1
in ISO-8859-1 and Microsoft CP-1252 but 0x96
if you were using Mac OS Roman encoding. If you wanted to write Ź
it was 0xAC
in ISO-8859-2 but 0x8F
in Microsoft CP-1250 and you were out of luck in Mac OS Roman encoding which was not supporting it. It was truly a great time to receive document from a friend using different machine. Here are two encodings which allowed to write Polish characters, compared side by side. Above green line is base ASCII. Below green line is chaos with differences marked in red.
Issue 2 - Encoding switching
What if you wanted to write a Spanish name in your Russian text? If characters you needed were in two different encodings text had to contain some hidden instructions to switch encodings on the fly. Every office suite, every editor, every email client was using their own method back then. That led to common issues with copy-pasting, because those hidden instructions were not understood by another program. Copy-paste. Something we take for granted today was a painful experience in the past.
How bad was it? Well, I took liberty of searching in how many ways you could write Polish alphabet letters. I found whooping 26 encodings:
Hint: If you deal with retro tech iconv
is your friend, allowing to convert between 140 encodings.
iconv -f CP1250 -t ISO-8859-2 windows_file.txt > iso_file.txt
On a white horse
Unification was an urgent need and when Unicode consortium announced "hey, we propose a common encoding for ALL characters to end this madness" it took computer world by a storm. Here is some interesting graphics from Wiki showing encodings popularity in years 2010-2021 and total UTF-8 domination we know today:
Coming up next: Variable encoding length to the rescue!
Posted on July 27, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
October 31, 2024