5 ways to implement case-insensitive search in SQLite with full Unicode support

shallowdepth

shallowdepth

Posted on January 23, 2022

5 ways to implement case-insensitive search in SQLite with full Unicode support

Recently I needed a case-insensitive search in SQLite to check if an item with the same name already exists in one of my projects – listOK. At first, it looked like a simple task, but upon deeper dive, it turned out to be easy, but not simple at all, with many twists and turns.

Built-in SQLite capabilities and their drawbacks

In SQLite you can get a case-insensitive search in three ways:

-- 1. Use a NOCASE collation
-- (we will look at other ways for applying collations later):
SELECT * 
    FROM items 
    WHERE text = "String in AnY case" COLLATE NOCASE;

-- 2. Normalize all strings to the same case,
-- does not matter lower or upper:
SELECT * 
    FROM items 
    WHERE LOWER(text) = "string in lower case";

-- 3. Use LIKE operator which is case insensitive by default:
SELECT * 
    FROM items 
    WHERE text LIKE "String in AnY case";
Enter fullscreen mode Exit fullscreen mode

If you use SQLAlchemy and its ORM these approaches will look as follows:

from sqlalchemy import func
from sqlalchemy.orm.query import Query

from package.models import YourModel


text_to_find = "Text in AnY case"

# NOCASE collation
Query(YourModel)
.filter(
    YourModel.field_name.collate("NOCASE") == text_to_find
)

# Normalizing text to the same case
Query(YourModel)
.filter(
    func.lower(YourModel.field_name) == text_to_find.lower()
).all()

# LIKE operator. No need to use SQLAlchemy's ilike
# since SQLite LIKE is already case-insensitive.
Query(YourModel)
.filter(YourModel.field_name.like(text_to_find))
Enter fullscreen mode Exit fullscreen mode

All these approaches are not ideal. First, without special considerations they do not make use of indexes on the field they are working on, with LIKE being the worst offender: in most cases it is incapable of using indexes. More on the use of indexes for case-insensitive queries is below.

Second, and more importantly, they have a rather limited understanding of what case-insensitive mean:

SQLite only understands upper/lower case for ASCII characters by default. The LIKE operator is case sensitive by default for unicode characters that are beyond the ASCII range. For example, the expression 'a' LIKE 'A' is TRUE but 'æ' LIKE 'Æ' is FALSE.

It is not a problem if you plan to work with strings that contain only English alphabet letters, numbers, etc. I needed the full Unicode spectrum, so a better solution was in order.


Below I summarize five ways to achieve case insensitive search/comparison in SQLite for all Unicode symbols. Some of these solutions can be adapted to other databases and for implementing Unicode-aware LIKE, REGEXP, MATCH, and other functions, although these topics are out of the scope of this post.

We will look at the pros and cons of each approach, implementation details, and, finally, at indexes and performance considerations.

Solutions

1. ICU extension

Official SQLite documentation mentions the ICU extension as a way to add complete support for Unicode in SQLite. ICU stands for International Components for Unicode.

ICU solves the problems of both case-insensitive LIKE and comparison/search, plus adds support for different collations for a good measure. It may even be faster than some of the later solutions since it is written in C and is more tightly integrated with SQLite.

However, it comes with its challenges:

  1. It is a new type of dependency: not a Python library, but an extension that should be distributed together with the application.

  2. ICU needs to be compiled before use, potentially for different OS and platforms (not tested).

  3. ICU does not itself implement Unicode conversions, but relies on the underline operating system – I have seen multiple mentions of OS-specific issues, especially with Windows and macOS.


All other solutions will depend on your Python code to perform the comparison, so it is important to choose the right approach to converting and comparing strings.

Choosing the right python function for case-insensitive comparison

To perform case-insensitive comparison and search we need to normalize strings to one case. My first instinct was to use str.lower() for this. It will work in most circumstances, but it is not the proper way. Better to use str.casefold() (docs):

Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.

Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to "ss".

Therefore, below we will use the str.casefold() function for all conversions and comparisons.

2. Application-defined collation

To perform a case-insensitive search for all Unicode symbols we need to define a new collation in the application after connecting to the database (documentation). Here you have a choice – overload the built-in NOCASE or create your own – we will discuss the pros and cons below. For the sake of an example we will use a new name:

import sqlite3

# Custom collation, maybe it is more efficient
# to store strings
def unicode_nocase_collation(a: str, b: str):
    if a.casefold() == b.casefold():
        return 0
    if a.casefold() < b.casefold():
        return -1
    return 1

connection.create_collation(
    "UNICODE_NOCASE", unicode_nocase_collation
)

# Connect to the DB and register the function
connection = sqlite3.connect("your_db_path")
connection.create_collation(
    "UNICODE_NOCASE", unicode_nocase_collation
)

# Or, if you use SQLAlchemy you need to register
# the collation via an event
@sa.event.listens_for(sa.engine.Engine, 'connect')
def sqlite_engine_connect(connection, _):
    connection.create_collation(
    "UNICODE_NOCASE", unicode_nocase_collation
)
Enter fullscreen mode Exit fullscreen mode

Collations have several advantages compared to the next solutions:

  1. They are easy to use. You can specify collation in the table schema and it will be automatically applied to all queries and indexes on this field unless you specify otherwise:

    CREATE TABLE test (text VARCHAR COLLATE UNICODE_NOCASE);
    

    For the sake of completeness, let's look at two more ways to use collations:

    -- In a particular query:
    SELECT * FROM items
        WHERE text = "Text in AnY case" COLLATE UNICODE_NOCASE;
    
    -- In an index:
    CREATE INDEX IF NOT EXISTS idx1 
        ON test (text COLLATE UNICODE_NOCASE);
    
    -- Word of caution: your query and index 
    -- must match exactly,including collation, 
    -- otherwise, SQLite will perform a full table scan.
    -- More on indexes below.
    EXPLAIN QUERY PLAN
        SELECT * FROM test WHERE text = 'something';
    -- Output: SCAN TABLE test
    EXPLAIN QUERY PLAN
        SELECT * FROM test WHERE text = 'something' COLLATE NOCASE;
    -- Output: SEARCH TABLE test USING COVERING INDEX idx1 (text=?)
    
  2. Collation provides case-insensitive sorting with ORDER BY out of the box. It is especially easy to get if you define the collation in the table schema.

Performance-wise collations have some peculiarities, which we will discuss further.

3. Application-defined SQL function

Another way to achieve case-insensitive search is to create an application-defined SQL function (documentation):

import sqlite3

# Custom function
def casefold(s: str):
    return s.casefold()

# Connect to the DB and register the function
connection = sqlite3.connect("your_db_path")
connection.create_function("CASEFOLD", 1, casefold)

# Or, if you use SQLAlchemy you need to register 
# the function via an event
@sa.event.listens_for(sa.engine.Engine, 'connect')
def sqlite_engine_connect(connection, _):
    connection.create_function("CASEFOLD", 1, casefold)
Enter fullscreen mode Exit fullscreen mode

In both cases create_function accepts up to four arguments:

  • name of the function as it will be used in the SQL queries
  • number of arguments the function accepts
  • the function itself
  • optional bool deterministic, default False (added in Python 3.8) – it is important for indexes, which we will discuss below.

As with collations, you have a choice – overload built-in function (for instance, LOWER) or create new. We will look at it in more detail later.

4. Compare in the application

Another way of case-insensitive search would be comparing in the app itself, especially if you could narrow down the search by using an index on other fields. For instance, in listOK case-insensitive comparison is needed for items in a particular list. Therefore, I could select all items in the list, normalize them to one case and compare them with the normalized new item.

Depending on your circumstances it is not a bad solution, especially if the subset you will be comparing with is small. However, you will not be able to utilize database indexes on the text, only on other parameters you will be using to narrow down the scope.

The advantage of this approach is its flexibility: in the application you can check not only equality but, for instance, implement "fuzzy" comparison to take into account possible misprints, singular/plural forms, etc. This is the route I chose for listOK since the bot needed fuzzy comparison for the "smart" item creation.

In addition, it eliminates any coupling with the database – it is simple storage that does not know anything about the data.

5. Store normalized field separately

There is one more solution: create a separate column in the database and keep there normalized text you will be searching on. For instance, the table may have this structure (only relevant fields):

id name name_normalized
1 Sentence capitalization sentence capitalization
2 CAPITAL LETTERS capital letters
3 Non-ASCII symbols: Найди Меня non-ascii symbols: найди меня

This may look excessive at first: you always need to keep the normalized version updated and effectively doubling the size of the name field. However, with ORMs or even manually it is easy to do and the disk space plus RAM is relativity cheap.

Advantages of this approach:

  • It completely decouples the application and database – you can easily switch.

  • You can pre-process normalized filed if your queries require it (trim, remove punctuation or spaces, etc.).

Should you overload built-in functions and collations?

When using application-defined SQL functions and collations you often have a choice: use a unique name or overload built-in functionality. Both approaches have their pros and cons in two main dimensions:

First, reliability/predictability when for some reason (a one-off mistake, bug, or intentionally) you do not register these functions or collations:

  • Overloading: the database will still work, but the results may not be correct:

    • the built-in function/collation will behave differently than their custom counterparts;
    • if you used now absent collation in an index it will appear to work, but results may be wrong even when reading;
    • if the table with index and index using custom function/collation gets updated, the index may get corrupted (updated using built-in implementation), but continue working as if nothing happened.
  • Not overloading: the database will not work in any respects where the absent functions or collations are used:

    • if you use an index on an absent function you will be able to use it for reading, but not for updates;
    • indexes with application-defined collation will not work at all, since they use the collation while searching in the index.

Second, accessibility outside of the main application: migrations, analytics, etc.:

  • Overloading: you will be able to modify the database without problem, keeping in mind the risk of corrupting indexes.

  • Not overloading: in many cases you will need to register these functions or collations or take extra steps to avoid parts of the database that depend on it.

If you decide to overload, it may be a good idea to rebuild indexes based on custom functions or collations in case they get wrong data recorded there, for example:

-- Rebuild all indexes using this collation
REINDEX YOUR_COLLATION_NAME;

-- Rebuild particular index
REINDEX index_name;

-- Rebuild all indexes
REINDEX;
Enter fullscreen mode Exit fullscreen mode

Performance of application-defined functions and collations

Custom functions or collation are much slower than built-in functions: SQLite "returns" to your application each time it calls the function. You can easily check it by adding a global counter to the function:

counter = 0

def casefold(a: str):
    global counter
    counter += 1
    return a.casefold()

# Work with the database

print(counter)
# Number of times the function has been called
Enter fullscreen mode Exit fullscreen mode

If you are querying rarely or your database is small, you will not see any meaningful difference. However, if you do not use an index on this function/collation, the database may perform a complete table scan applying the function/collation on each row. Depending on the size of the table, hardware, and the number of requests, the low performance may be surprising. Later I will publish a review of application-defined functions and collations performance.

Strictly speaking, collations are a bit slower than SQL functions since for each comparison they need to casefold two strings, instead of one. Although this difference is very small: in my tests, the casefold function was faster than similar collation for around 25% which amounted to a difference of 10 seconds after 100 million iterations.

Indexes and case-insensitive search

Indexes and functions

Let's start with the basics: if you define an index on any field, it will not be used in queries on a function applied to this field:

CREATE TABLE table_name (id INTEGER, name VARCHAR);
CREATE INDEX idx1 ON table_name (name);
EXPLAIN QUERY PLAN
    SELECT id, name FROM table_name WHERE LOWER(name) = 'test';
-- Output: SCAN TABLE table_name
Enter fullscreen mode Exit fullscreen mode

For such queries you need a separate index with the function itself:

CREATE INDEX idx1 ON table_name (LOWER(name));
EXPLAIN QUERY PLAN
    SELECT id, name 
        FROM table_name WHERE LOWER(name) = 'test';
-- Output: SEARCH TABLE table_name USING INDEX idx1 (<expr>=?)
Enter fullscreen mode Exit fullscreen mode

In SQLite, it can be done on a custom function as well, but it must be marked as deterministic (meaning that with the same inputs it returns the same result):

connection.create_function(
    "CASEFOLD", 1, casefold, deterministic=True
)
Enter fullscreen mode Exit fullscreen mode

After that you can create an index on a custom SQL function:

CREATE INDEX idx1 
    ON table_name (CASEFOLD(name));
EXPLAIN QUERY PLAN
    SELECT id, name 
        FROM table_name WHERE CASEFOLD(name) = 'test';
-- Output: SEARCH TABLE table_name USING INDEX idx1 (<expr>=?)
Enter fullscreen mode Exit fullscreen mode

Indexes and collations

The situation with collations and indexes is similar: for a query to utilize an index they must use the same collation (implied or provided expressly), otherwise, it will not work.

-- Table without specified collation will use BINARY
CREATE TABLE test (id INTEGER, text VARCHAR);

-- Create an index with a different collation
CREATE INDEX IF NOT EXISTS idx1 ON test (text COLLATE NOCASE);


-- Query will use default column collation -- BINARY
-- and the index will not be used
EXPLAIN QUERY PLAN
    SELECT * FROM test WHERE text = 'test';
-- Output: SCAN TABLE test


-- Now collations match and index is used
EXPLAIN QUERY PLAN
    SELECT * FROM test WHERE text = 'test' COLLATE NOCASE;
-- Output: SEARCH TABLE test USING INDEX idx1 (text=?)
Enter fullscreen mode Exit fullscreen mode

As noted above, collation can be specified for a column in the table schema. This is the most convenient way – it will be applied to all queries and indexes on the respective field automatically unless you specify otherwise:

-- Using application defined collation UNICODE_NOCASE from above
CREATE TABLE test (text VARCHAR COLLATE UNICODE_NOCASE);

-- Index will be built using the collation
CREATE INDEX idx1 ON test (text);

-- Query will utilize index and collation automatically
EXPLAIN QUERY PLAN
    SELECT * FROM test WHERE text = 'something';
-- Output: SEARCH TABLE test USING COVERING INDEX idx1 (text=?)
Enter fullscreen mode Exit fullscreen mode

Which solution to choose?

To choose a solution we need some criteria for comparison:

  1. Simplicity – how difficult it is to implement and maintain it

  2. Performance – how fast your queries will be

  3. Extra space – how much additional database space the solution requires

  4. Coupling – how much your solution intertwines the code and storage

Solution Simplicity Performance (relative, without index) Extra space Coupling
ICU extension Difficult: requires a new type of dependency and compiling Medium to high No Yes
Custom collation Simple: allows to set collation in the table schema and apply it automatically to any query on the field Low No Yes
Custom SQL function Medium: requires either building an index based on it or using in all relevant queries Low No Yes
Comparing in the app Simple Depends on the use case No No
Storing normalized string Medium: you need to keep the normalized string updated Low to Medium x2 No

As usual, the choice of the solution will depend on your use case and performance demands. Personally, I would go either with custom collation, comparing in the app, or storing a normalized string. For example, in listOK, I first used a collation and moved to comparing in the app when added fuzzy search.

💖 💪 🙅 🚩
shallowdepth
shallowdepth

Posted on January 23, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related