Metal Umlauts, Searching, and Other Unicode Fun

jcolag

John Colagioia (he/him)

Posted on March 25, 2020

Metal Umlauts, Searching, and Other Unicode Fun

(You can find the original version of this article on my blog, where I talk about this and a variety of other topics.)

Unicode—the computer “alphabet” that includes all the characters you see on this page, plus most modern writing systems in common use (∂), plus punctuation and currency (௹), plus arrows and mathematical notation (↛), plus drawing symbols (✵), plus emoji (🐣), and more—has a lot going on in it beyond the obvious complexity of multiple formats (UTF-8, UTF-16, GB18030, UTF-32, BOCU, SCSU, UTF-7, and probably others) and byte orderings. The part that grabbed my interest, recently, is the idea of Normal Forms, of which Unicode has four.

  • NFD: Canonical Decomposition
  • NFC: Canonical Composition
  • NFKD: Compatibility Decomposition
  • NFKC: Compatibility Composition

Specifically, Normalization Form Canonical Decomposition interests me, because it represents each accented letter in a string as the base letter followed by any accents.

Better yet, in JavaScript (and more languages; see below), it’s easy to change normalization forms. Specifically, for these purposes, we want:

str.normalize('NFD');
Enter fullscreen mode Exit fullscreen mode

These decomposed letters have some nice uses.

Sorting

At least in English, diacritical marks are usually a marker for either history (fiancée, über, soupçon, Māori, piñata) or pronunciation (naïve, coöperate), rather than as an element of spelling; some of us are sticklers for getting the accents right, but most English-speakers ignore them completely. This is especially true of names, where we generally want a person’s name to be represented properly out of respect (Karel Čapek, Charlotte Brontë, Beyoncé Knowles), when that name can come from anywhere in the world, but English treats it more as an affectation than a critical element of the name.

Of particular importance, here, is that we generally wish to sort a name with accented letters as if the accents don’t exist. So, we want piñata to sort as if it was spelled “pinata” and Čapek to sort like “Capek.”

The decomposed form allows us to do this by stripping the diacritical marks out of the string when we sort it.

var sortedStrings = strings.sort((a,b) => {
  var aNorm = a
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '')
    .toLowerCase();
  var bNorm = b
    .normalize('NFD')
    .replace(/[\u0300-\u036f]/g, '')
    .toLowerCase();
  return aNorm < bNorm ? -1 : 1;
});
Enter fullscreen mode Exit fullscreen mode

That admittedly looks a bit complicated, given the regular expression, but the entire process boils down to decomposing each string, and stripping off the diacritical marks (Unicode code-points 0x0300 to 0x036f), and converting the remaining letters to lower-case. Then, we just compare the resulting strings.

In other words, by normalizing the name, the computer represents “Čapek” something like

[C] [caron] [a] [p] [e] [k]
Enter fullscreen mode Exit fullscreen mode

Then, we remove any diacritical marks (the caron or ˇ   in this case) by replacing it with nothing, leaving us with only the unaccented Latin letters.

Or…

I can't think of a use for this idea, but it occurs to me that it's also possible to keep the diacritical marks and throw out or replace the letters.

Searching

More so than with sorting, it’s also a better experience to search without regard for diacritical marks. For example, an increasing number of laws (with political motivations that we don't need to discuss, here) are posed as “exact match” measures, which require that voter registration documents transcribed from handwritten forms be identical to personal identification documents, meaning that the exactness of accents and diacritical marks relies primarily on the comprehension and interest of an underpaid, overworked data entry clerk using a keyboard that doesn't have accents on it.

By the same token, even something with much lower stakes like searching an employee directory shouldn’t rely on the person searching for Beyoncé realizing that she has an acute accent in her name and that Human Resources input her name properly.

And that just barely touches on the problem that a standard keyboard for English doesn’t have a way to type accented characters, with operating systems often adding ways that aren't exactly trivial. So, even if a user has cleared the above hurdles, it’s still a waste of the user’s time to make them hunt down the exact spelling with diacritical marks.

We can solve this problem using an approach similar to what we saw in sorting, normalizing and stripping both the target string and the corpus being searched.

Metal Umlauts (or M͇ͭeţal Um͆l̼a͍u̓t̨s)

It’s a bit before my time, but one of my favorite television shows growing up (via re-runs and now streaming) is Mission: Impossible, in no small part because of the signage in their fictional foreign countries. Especially in earlier episodes, to make foreign countries seem both exotic and approachable to American audiences, show creator Bruce Geller had the idea of creating signs written mostly in English, but a version of English with clever misspellings representative of stereotypes of certain parts of the world, often including bogus diacritical marks.

For example, if you pay careful attention, you’ll easily spot both Zöna Restrik (for Restricted Area) or Prıziion Mılıtık (for Military Prison) in certain episodes.

And, of course, if you’re a heavy metal music fan, you’re undoubtedly familiar with the similar but distinct Metal Umlaut, though its use seems surprisingly limited to the diaeresis (¨) mark.

If we wanted to do something like transforming English text to "Gellerese"…well, you’re on your own figuring out how to change the base spelling in a reasonable way. But adding bogus diacritical marks? That, we can definitely do.

let output = '';
str = str.normalize('NFD');
for (let i = 0; i < str.length; i++) {
  const c = str[i];
  output += c;
  if (c.match(/[a-z]/i)) {
    // The math on the next line isn't necessary to the example;
    // I'll explain what it's for in the paragraph below.
    const rLen = Math.floor(Math.log2(Math.random() * 3));
    for (j = 0; j < rLen; j++) {
      const rCh = 0x0300 + Math.floor(Math.random() * 0x006f);
      output += String.fromCharCode(rCh);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Again, we normalize the input string. But instead of removing diacritical marks as we’ve been doing, here we visit each character and, if it’s a letter, we pick a random-but-small number of diacritical marks to add (using log2() pushes the numbers lower and biases the distribution towards the lower end, so we're more likely to get zero or one mark, but can potentially get more), and then selects the necessary diacritical marks from that same 0x0300 to 0x036f range we previously needed to remove.

If desired, this can easily be made more “intelligent” with lists of diacritical marks that are more appropriate to that letter, so that you don’t end up with implausible combinations like what you see in the above section heading.

While this sounds like just a joke or a tool for fiction, I now sometimes use techniques like this to make sure that diacritical marks display properly after processing text. By generating them randomly, in bulk, and in ways not generally found in real text, I get a better sense of how bad a display might look.

In any case, it might be a decent idea to call output.normalize('NFC') at the end, to set the characters back to their “composed” forms. And when I say “decent idea,” I mean “probably not necessary, but nice for the sake of consistency.”

Exception

One place where normalization has no effect is the Polish L-with-stroke (Ł or ł). It turns out that those are letters unto themselves, rather than letters with a diacritical mark. So, if you’re planning on using any of these techniques, you will want to take that into account, probably by replacing the character separately.

Other (Programming) Languages

The above sample code snippets are all in JavaScript, but the Windows API supports NormalizeString() and .NET has supported String.Normalize() for quite some time. Ruby, similarly, supports string.unicode_normalize(). It shouldn’t be hard to find the equivalent for other languages, now that we know the key words to search for are “unicode normalize,” maybe throwing in “nfd” or “decomposed” to make the context clearer.

Happy…err, umlauting? Sure. Let’s go with that!


Credits: Untitled header photograph from PxHere, made available under the CC0 1.0 Universal Public Domain Dedication.

💖 💪 🙅 🚩
jcolag
John Colagioia (he/him)

Posted on March 25, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related