Considerations for Unicode and Searching

mdchaney

Michael Chaney

Posted on July 4, 2024

Considerations for Unicode and Searching

In previous posts here and on Twitter (now X) I've written about using Unicode and UTF-8 in general application development. In this article I'm going to expand a bit on how to handle searching when Unicode is present.

To quickly summarize using Unicode in modern application development - it's probably something you won't even have to think about if you're using a modern high-level language. Strings are almost universally assumed to be in UTF-8 format, so everything just works. The biggest problem is when importing data from outside - there's still a ton of code out there that is writing using older 8-bit encodings (generally Latin-1/ISO-8859-1 in the western world) so it's best to check your data and force it to UTF-8 before processing.

With that out of the way, let's consider "Beyoncé". Why? For purposes of this discussion simply because that name has a Latin Small Letter E with Acute character - also known as "eacute" - in it.

michael-chaneys-computer-2:~ mdchaney$ hexdump -C
Beyoncé
00000000  42 65 79 6f 6e 63 c3 a9  0a                       |Beyonc...|
Enter fullscreen mode Exit fullscreen mode

In this hex dump, "c3 a9" is the two-byte UTF-8 encoding for unicode character E9. In a text file, on this web site's database, etc. that will be stored as the two bytes that you see above.

Here's the issue, though. If I'm searching for "Beyoncé" - how do I even get a funky accent thing above the "e"? I'll tell you how if you're on a Mac - press "option e", let them both up, then press the vowel over which you wish to place the acute symbol. In other words, option-e, then e. Beyoncé.

So, that's it, right?

Of course not. Let's say I have a music database (I really do have a bunch of them) and Beyoncé is in there. I want people to be able to search for her name and find her music. The problem? I'll tell you what the problem is - a bunch of Neanderthals using the internet that don't know that little Mac trick that I learned 2 minutes ago about typing an e-with-an-acute-mark from my keyboard.

Okay, I'll confess. Even I would just type in "beyonce" if I was searching for her music. Not even a capital B, much less the funky acute mark over the e. Google has been so good for so long that we expect that search to just work because it always has at Google.

So, how do we handle this when we write the search code?

Well, what are we using for search?

I want to spend the brunt of this article talking about how to do this in Postgres, partly because it's a little more difficult there. But let me start in Apache Solr, which is where I first worked on these issues.

Apache Solr Configuration

Solr is based on the Lucene library, as is Elastic. I mention this because these two search engines provide most private web site search capabilities on the internet. I started using Lucene around 25 years ago in Perl - it works well.

In Apache Solr, there's a filter that can be added to the config for a search index called the "ASCIIFoldingFilter". The code is a sight to behold, as it contains the logic for turning "Beyoncé" into "Beyonce". There's another case folder to further turn that into "beyonce".

Go ahead and check out the source code linked above. I would direct your attention specifically to line 434 which matches "é", aka "latin small letter e with acute". If you scroll all the way to line 474 you'll find that the "é" along with 40 other friends gets mapped to a plain old "e".

You've probably caught on even though I haven't explicitly stated it yet: for purposes of searching we have to get our text into the simplest form possible both for storage as well as for searching. After we munge both sets of data in the same way, we end up with this being stored:

beyonce
Enter fullscreen mode Exit fullscreen mode

And any of these terms matching:

Beyoncé
beyoncé
Beyonce
beyonce
Enter fullscreen mode Exit fullscreen mode

Oddly enough, a bunch of things map to the plain old "e", so:

Beyoncǝ
Enter fullscreen mode Exit fullscreen mode

also works. If you want to really get nuts, Imagine all the things that map to any of those letters.

ⓑǝẙǾɳꜾᶔ
Enter fullscreen mode Exit fullscreen mode

If you run that string through the solr.ASCIIFoldingFilter, you'll get "beyonce" out the other side.

I'm not saying that so that you can do it, although nobody's stopping you. The point is that there are many characters that map to any single ASCII character in this scenario. They will be stored in plain ASCII where possible.

There are plenty of Unicode characters that won't be changed to anything else, by the way. Emojis are a common group of characters that'll go straight through. Chinese characters (regardless of what the multi-lingual genius at the tattoo parlor tells you) don't actually map to ASCII characters and vice versa. Same with Korean, Thai, Arabic, etc. In fact, the vast majority of Unicode characters are untouched by ASCIIFoldingFilter.

But, for standard text from the Western World (The Americas and Europe) the characters will map down to the closest ASCII equivalent. Accents, umlauts, little circles - all those things get dropped for indexing.

Apache Solr is a great search engine and I've used it extensively for well over a decade. It handles all of this text munging extremely well. But I also use PostgreSQL and it has a very robust full-text search engine built right in that might be good for your project, especially if you're already using it as your RDBMS data store.

Configuring PostgreSQL For Search

I love Postgres and use it just about everywhere. But the full-text search functionality requires a little bit of work to handle text in the manner that you want it to. Thankfully, their documentation is great, AI assistants such as ChatGPT and Claude seem to know everything about this subject, and Postgres itself comes with some awesome tools to make testing a breeze.

Solr comes with various filters that you can use to munge your input text into the output text. Postgres has fewer available options, but the default setup is pretty good for English text. This includes a standard "stopword" dictionary to keep common words out of your index, implement case folding, and word stemming.

Relevant to our discussion today, it also comes with the "unaccent" dictionary to handle fixing "Beyoncé", and "mañana", and whatever else you might come up with. But it's not included by default, so you'll have to do a little work.

Let's first see how it handles all of this by default.

Postgres includes the excellent "ts_debug" function which allows you to see exactly what it thinks about your text.

img3_dev=> select * from ts_debug('Beyoncé is coming mañana');
   alias   |    description    |  token  |  dictionaries  |  dictionary  |  lexemes  
-----------+-------------------+---------+----------------+--------------+-----------
 word      | Word, all letters | Beyoncé | {english_stem} | english_stem | {beyoncé}
 blank     | Space symbols     |         | {}             |              | 
 asciiword | Word, all ASCII   | is      | {english_stem} | english_stem | {}
 blank     | Space symbols     |         | {}             |              | 
 asciiword | Word, all ASCII   | coming  | {english_stem} | english_stem | {come}
 blank     | Space symbols     |         | {}             |              | 
 word      | Word, all letters | mañana  | {english_stem} | english_stem | {mañana}
Enter fullscreen mode Exit fullscreen mode

This shows the default parser in action. "is" is a stop word, so it gets kicked out. "coming" is run through the default stemmer and is normalized as "come". But "Beyoncé" and "mañana" pass through unchanged save for being downcased.

Now, one can make the case that letters like "é" and "ñ" don't occur in regular English so this is the proper functionality. I'm handling searches where names are a common element in the search, and accented characters are much more common in names. Even if we restrict ourselves to the United States where English is the most common language, clearly Spanish is the second most common and French is quite common as well.

It's not just Beyoncé - Michael Bublé has an e-with-acute in his name as well. And many others.

Just searching Google for "singers with accents in their names" brings up a litany of issues from programmers handling this situation. Look at this one, which also seems to be based on lack of Unicode normalization:

https://www.discogs.com/forum/thread/826212

Here's someone talking about this problem in 2009:

https://musicmachinery.com/2009/04/10/removing-accents-in-artist-names/

Anyway, the point is that even if you're handling straight English text, removing diacritics is important for name matching. And since the English language doesn't have diacritics, it won't really hurt to turn those letters into their straight ASCII equivalent since they won't exist outside of names or some non-English words.

We have to create the "unaccent" extension in our database, create the new text search configuration, and finally modify it to add "unaccent" to the mix.


img3_dev=> CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE EXTENSION
img3_dev=> CREATE TEXT SEARCH CONFIGURATION music_search ( COPY = english );
CREATE TEXT SEARCH CONFIGURATION
img3_dev=> ALTER TEXT SEARCH CONFIGURATION music_search
  ALTER MAPPING FOR hword, hword_part, word
  WITH unaccent, english_stem;
ALTER TEXT SEARCH CONFIGURATION
img3_dev=> select * from ts_debug('music_search', 'Beyoncé is coming mañana');
   alias   |    description    |  token  |      dictionaries       |  dictionary  |  lexemes  
-----------+-------------------+---------+-------------------------+--------------+-----------
 word      | Word, all letters | Beyoncé | {unaccent,english_stem} | unaccent     | {Beyonce}
 blank     | Space symbols     |         | {}                      |              | 
 asciiword | Word, all ASCII   | is      | {english_stem}          | english_stem | {}
 blank     | Space symbols     |         | {}                      |              | 
 asciiword | Word, all ASCII   | coming  | {english_stem}          | english_stem | {come}
 blank     | Space symbols     |         | {}                      |              | 
 word      | Word, all letters | mañana  | {unaccent,english_stem} | unaccent     | {manana}
Enter fullscreen mode Exit fullscreen mode

Note that ts_debug isn't perfect. If you run this through to_tsvector, you get a slightly different result:

img3_dev=> select to_tsvector('music_search', 'Beyoncé is coming mañana');
          to_tsvector           
--------------------------------
 'beyonc':1 'come':3 'manana':4
Enter fullscreen mode Exit fullscreen mode

Note that the stemming algorithm turns "beyonce" into "beyonc", and that's fine. The words will match.

To show you how this works, consider the following queries where I use the standard built-in "english" search configuration:

img3_dev=> select to_tsvector('english', 'Beyoncé is coming mañana') @@  
    to_tsquery('english', 'beyoncé');
 ?column? 
----------
 t
(1 row)

img3_dev=> select to_tsvector('english', 'Beyoncé is coming mañana') @@  
    to_tsquery('english', 'beyonce');
 ?column? 
----------
 f
(1 row)
Enter fullscreen mode Exit fullscreen mode

If we search for "beyoncé", it's a match, but if we search for "beyonce" there's no match.

Switching to our new "music_search" configuration:

img3_dev=> select to_tsvector('music_search', 'Beyoncé is coming mañana') @@  
    to_tsquery('music_search', 'beyonce');
 ?column? 
----------
 t
(1 row)

img3_dev=> select to_tsvector('music_search', 'Beyoncé is coming mañana') @@  
    to_tsquery('music_search', 'beyoncé');
 ?column? 
----------
 t
(1 row)
Enter fullscreen mode Exit fullscreen mode

There it is - works either way.

I would also note that the "unaccent" dictionary in Postgres has fewer mappings than the ASCIIFoldingFactory in Solr. But, quite a few are still there:

img3_dev=> select to_tsvector('music_search', 'Beyoncé is coming mañana') @@ 
    to_tsquery('music_search', 'Ƃⅇẙỗɳcᶔ');
 ?column? 
----------
 t
(1 row)
Enter fullscreen mode Exit fullscreen mode

If you want to see the dictionary, Postgres has these stored under your "share" directory, basically ".../share/postgresqlxx/tsearch_data", where "xx" is the version number. Depending on the installation, this will likely be under "/usr", but it might be under "/usr/local" or wherever Postgres is installed.

You can view the "unaccent.rules" file to see the list.

As an aside, if you want to gꝏf around (see what I did there?) with it and come up with weird ways to spell "Beyoncé" so it'll match, use "grep":

michael-chaneys-computer-2:dl5 mdchaney$ grep 'o$' /opt/local/share/postgresql16/tsearch_data/unaccent.rules 
ò  o
ó  o
ô  o
õ  o
ö  o
ø  o
ō  o
ŏ  o
ő  o
ơ  o
ǒ  o
ǫ  o
ǭ  o
ȍ  o
ȏ  o
ȫ  o
ȭ  o
ȯ  o
ȱ  o
ṍ o
ṏ o
ṑ o
ṓ o
ọ o
ỏ o
ố o
ồ o
ổ o
ỗ o
ộ o
ớ o
ờ o
ở o
ỡ o
ợ o
℅ c/o
№ No
ℴ o
ⱺ o
ꜵ ao
ꝋ o
ꝍ o
ꝏ oo
o o
Enter fullscreen mode Exit fullscreen mode

That shows you all the characters that will map to "o". Do that for all the letters, choose your favs, copy/paste, you get the idea.

Outtro

There's a lot to configuring a full-text search application on Postgres, but if you're going to search names this is pretty much required. And it's not difficult.

Feel free to hit me up in comments here or on X (@MichaelDChaney) for any further ideas that you might need help with.

💖 💪 🙅 🚩
mdchaney
Michael Chaney

Posted on July 4, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related