Go Lexical elements: Rune literals pt 1, Intro to Unicode
Jonathan Hall
Posted on July 7, 2023
I write a daily email about Go, and this is a repost from my series Exploring the Go Spec. You're invited to sign up and follow along.
Runes… Oh boy! This is one of bits of Go that shines for its elegant simplicity, but constantly trips up everyone (myself included). As such, I think this may be a 2-, or maybe even a 3-parter.
Let’s get started.
Rune literals
A rune literal represents a rune constant, an integer value identifying a Unicode code point.
If you’re already familiar with Unicode, and have a strong understanding of what a “code point” is, you can probably skip this one. See you later! 👋
If you’re not completely sure what a Unicode code point is, stick around… I’ll do my best to untangle this seemingly simple phrase.
First off, what is Unicode? It’s emojis, right? Ehh. If that’s your understanding of Unicode, which would be completely understandable, if your career in software is relatively young, then you need a bit more context. I encourage you to read a bit about the history of why Unicode was invented, and the problems it was meant to solve. This PDF, Introduction to Unicode: History of Character Codes is a good starting point.
But here are the highlights:
Before Unocide, we had many different coding systems. The most popular was ASCII, the American Standard Code for Information Interchange. But since not all of the world is American, this had obvious drawbacks. Different countries or languages would often have their own coding schemes, but this made it very difficult to share documents between regions.
If I wrote a document using the Cyrillic alphabet, then sent it to my colleague in France, he would likely see a jumble of French letters in seemingly random order.
So Unicode came along to Unify all the codes.
Great. So now instead of 127 possible characters in ASCII, we have a virtually unlimited number of characters, right?
Not so fast.
While there’s room in Unicode for more than 1 million individual code points, most are not (yet) defined. But what’s more, Unicode is smarter than ASCII in a number of ways. It is possible to combine Unicode code points to form a single physical character.
For example, if you want to display the Cyrillic letter ў, this can be done by combining the Cyrillic Y (у) with the breve mark (˘), to give you ў. But while this looks like a single character, and in print terms it is, it’s actually two Unicode codepoints. As such, this code won’t compile:
x := rune('ў')
Because while it visually looks like we’re quoting a single character, that character is composed of two Unicode codepoints, and (as we’ll see in the next section), a rune literal must be a single Unicode codepoint.
Now if that’s not confusing enough, this code will compile:
x := rune('ў')
What’s the difference?
Well, Unicode includes a number of precomposed characters. This is a nice convenience for languages that commonly use a large number of these types of diacritics. But it’s an incovenience for us programmers. Not only does it mean that of these two character representations, only one is a valid rune
, it also introduces certain headaches when trying to compare Unicode strings for equality, or when sorting, etc.
A last note for today, especially for anyone very new to Unicode. This concept of combining characters to add diacritics and other markings to an existing letter is the same way that Unicode emojis are modified to change skin tone, gender, or other attributes.
Unicode is pretty powerful. And confusing at times.
Posted on July 7, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.