5 ways to implement case-insensitive search in SQLite with full Unicode support
shallowdepth
Posted on January 23, 2022
Recently I needed a case-insensitive search in SQLite to check if an item with the same name already exists in one of my projects – listOK. At first, it looked like a simple task, but upon deeper dive, it turned out to be easy, but not simple at all, with many twists and turns.
Built-in SQLite capabilities and their drawbacks
In SQLite you can get a case-insensitive search in three ways:
-- 1. Use a NOCASE collation
-- (we will look at other ways for applying collations later):
SELECT *
FROM items
WHERE text = "String in AnY case" COLLATE NOCASE;
-- 2. Normalize all strings to the same case,
-- does not matter lower or upper:
SELECT *
FROM items
WHERE LOWER(text) = "string in lower case";
-- 3. Use LIKE operator which is case insensitive by default:
SELECT *
FROM items
WHERE text LIKE "String in AnY case";
If you use SQLAlchemy and its ORM these approaches will look as follows:
from sqlalchemy import func
from sqlalchemy.orm.query import Query
from package.models import YourModel
text_to_find = "Text in AnY case"
# NOCASE collation
Query(YourModel)
.filter(
YourModel.field_name.collate("NOCASE") == text_to_find
)
# Normalizing text to the same case
Query(YourModel)
.filter(
func.lower(YourModel.field_name) == text_to_find.lower()
).all()
# LIKE operator. No need to use SQLAlchemy's ilike
# since SQLite LIKE is already case-insensitive.
Query(YourModel)
.filter(YourModel.field_name.like(text_to_find))
All these approaches are not ideal. First, without special considerations they do not make use of indexes on the field they are working on, with LIKE
being the worst offender: in most cases it is incapable of using indexes. More on the use of indexes for case-insensitive queries is below.
Second, and more importantly, they have a rather limited understanding of what case-insensitive mean:
SQLite only understands upper/lower case for ASCII characters by default. The LIKE operator is case sensitive by default for unicode characters that are beyond the ASCII range. For example, the expression 'a' LIKE 'A' is TRUE but 'æ' LIKE 'Æ' is FALSE.
It is not a problem if you plan to work with strings that contain only English alphabet letters, numbers, etc. I needed the full Unicode spectrum, so a better solution was in order.
Below I summarize five ways to achieve case insensitive search/comparison in SQLite for all Unicode symbols. Some of these solutions can be adapted to other databases and for implementing Unicode-aware LIKE
, REGEXP
, MATCH
, and other functions, although these topics are out of the scope of this post.
We will look at the pros and cons of each approach, implementation details, and, finally, at indexes and performance considerations.
Solutions
1. ICU extension
Official SQLite documentation mentions the ICU extension as a way to add complete support for Unicode in SQLite. ICU stands for International Components for Unicode.
ICU solves the problems of both case-insensitive LIKE
and comparison/search, plus adds support for different collations for a good measure. It may even be faster than some of the later solutions since it is written in C and is more tightly integrated with SQLite.
However, it comes with its challenges:
It is a new type of dependency: not a Python library, but an extension that should be distributed together with the application.
ICU needs to be compiled before use, potentially for different OS and platforms (not tested).
ICU does not itself implement Unicode conversions, but relies on the underline operating system – I have seen multiple mentions of OS-specific issues, especially with Windows and macOS.
All other solutions will depend on your Python code to perform the comparison, so it is important to choose the right approach to converting and comparing strings.
Choosing the right python function for case-insensitive comparison
To perform case-insensitive comparison and search we need to normalize strings to one case. My first instinct was to use str.lower()
for this. It will work in most circumstances, but it is not the proper way. Better to use str.casefold()
(docs):
Return a casefolded copy of the string. Casefolded strings may be used for caseless matching.
Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter 'ß' is equivalent to "ss". Since it is already lowercase,
lower()
would do nothing to 'ß';casefold()
converts it to "ss".
Therefore, below we will use the str.casefold()
function for all conversions and comparisons.
2. Application-defined collation
To perform a case-insensitive search for all Unicode symbols we need to define a new collation in the application after connecting to the database (documentation). Here you have a choice – overload the built-in NOCASE
or create your own – we will discuss the pros and cons below. For the sake of an example we will use a new name:
import sqlite3
# Custom collation, maybe it is more efficient
# to store strings
def unicode_nocase_collation(a: str, b: str):
if a.casefold() == b.casefold():
return 0
if a.casefold() < b.casefold():
return -1
return 1
connection.create_collation(
"UNICODE_NOCASE", unicode_nocase_collation
)
# Connect to the DB and register the function
connection = sqlite3.connect("your_db_path")
connection.create_collation(
"UNICODE_NOCASE", unicode_nocase_collation
)
# Or, if you use SQLAlchemy you need to register
# the collation via an event
@sa.event.listens_for(sa.engine.Engine, 'connect')
def sqlite_engine_connect(connection, _):
connection.create_collation(
"UNICODE_NOCASE", unicode_nocase_collation
)
Collations have several advantages compared to the next solutions:
-
They are easy to use. You can specify collation in the table schema and it will be automatically applied to all queries and indexes on this field unless you specify otherwise:
CREATE TABLE test (text VARCHAR COLLATE UNICODE_NOCASE);
For the sake of completeness, let's look at two more ways to use collations:
-- In a particular query: SELECT * FROM items WHERE text = "Text in AnY case" COLLATE UNICODE_NOCASE; -- In an index: CREATE INDEX IF NOT EXISTS idx1 ON test (text COLLATE UNICODE_NOCASE); -- Word of caution: your query and index -- must match exactly,including collation, -- otherwise, SQLite will perform a full table scan. -- More on indexes below. EXPLAIN QUERY PLAN SELECT * FROM test WHERE text = 'something'; -- Output: SCAN TABLE test EXPLAIN QUERY PLAN SELECT * FROM test WHERE text = 'something' COLLATE NOCASE; -- Output: SEARCH TABLE test USING COVERING INDEX idx1 (text=?)
Collation provides case-insensitive sorting with
ORDER BY
out of the box. It is especially easy to get if you define the collation in the table schema.
Performance-wise collations have some peculiarities, which we will discuss further.
3. Application-defined SQL function
Another way to achieve case-insensitive search is to create an application-defined SQL function (documentation):
import sqlite3
# Custom function
def casefold(s: str):
return s.casefold()
# Connect to the DB and register the function
connection = sqlite3.connect("your_db_path")
connection.create_function("CASEFOLD", 1, casefold)
# Or, if you use SQLAlchemy you need to register
# the function via an event
@sa.event.listens_for(sa.engine.Engine, 'connect')
def sqlite_engine_connect(connection, _):
connection.create_function("CASEFOLD", 1, casefold)
In both cases create_function
accepts up to four arguments:
- name of the function as it will be used in the SQL queries
- number of arguments the function accepts
- the function itself
- optional bool
deterministic
, defaultFalse
(added in Python 3.8) – it is important for indexes, which we will discuss below.
As with collations, you have a choice – overload built-in function (for instance, LOWER
) or create new. We will look at it in more detail later.
4. Compare in the application
Another way of case-insensitive search would be comparing in the app itself, especially if you could narrow down the search by using an index on other fields. For instance, in listOK case-insensitive comparison is needed for items in a particular list. Therefore, I could select all items in the list, normalize them to one case and compare them with the normalized new item.
Depending on your circumstances it is not a bad solution, especially if the subset you will be comparing with is small. However, you will not be able to utilize database indexes on the text, only on other parameters you will be using to narrow down the scope.
The advantage of this approach is its flexibility: in the application you can check not only equality but, for instance, implement "fuzzy" comparison to take into account possible misprints, singular/plural forms, etc. This is the route I chose for listOK since the bot needed fuzzy comparison for the "smart" item creation.
In addition, it eliminates any coupling with the database – it is simple storage that does not know anything about the data.
5. Store normalized field separately
There is one more solution: create a separate column in the database and keep there normalized text you will be searching on. For instance, the table may have this structure (only relevant fields):
id | name | name_normalized |
---|---|---|
1 | Sentence capitalization | sentence capitalization |
2 | CAPITAL LETTERS | capital letters |
3 | Non-ASCII symbols: Найди Меня | non-ascii symbols: найди меня |
This may look excessive at first: you always need to keep the normalized version updated and effectively doubling the size of the name
field. However, with ORMs or even manually it is easy to do and the disk space plus RAM is relativity cheap.
Advantages of this approach:
It completely decouples the application and database – you can easily switch.
You can pre-process normalized filed if your queries require it (trim, remove punctuation or spaces, etc.).
Should you overload built-in functions and collations?
When using application-defined SQL functions and collations you often have a choice: use a unique name or overload built-in functionality. Both approaches have their pros and cons in two main dimensions:
First, reliability/predictability when for some reason (a one-off mistake, bug, or intentionally) you do not register these functions or collations:
-
Overloading: the database will still work, but the results may not be correct:
- the built-in function/collation will behave differently than their custom counterparts;
- if you used now absent collation in an index it will appear to work, but results may be wrong even when reading;
- if the table with index and index using custom function/collation gets updated, the index may get corrupted (updated using built-in implementation), but continue working as if nothing happened.
-
Not overloading: the database will not work in any respects where the absent functions or collations are used:
- if you use an index on an absent function you will be able to use it for reading, but not for updates;
- indexes with application-defined collation will not work at all, since they use the collation while searching in the index.
Second, accessibility outside of the main application: migrations, analytics, etc.:
Overloading: you will be able to modify the database without problem, keeping in mind the risk of corrupting indexes.
Not overloading: in many cases you will need to register these functions or collations or take extra steps to avoid parts of the database that depend on it.
If you decide to overload, it may be a good idea to rebuild indexes based on custom functions or collations in case they get wrong data recorded there, for example:
-- Rebuild all indexes using this collation
REINDEX YOUR_COLLATION_NAME;
-- Rebuild particular index
REINDEX index_name;
-- Rebuild all indexes
REINDEX;
Performance of application-defined functions and collations
Custom functions or collation are much slower than built-in functions: SQLite "returns" to your application each time it calls the function. You can easily check it by adding a global counter to the function:
counter = 0
def casefold(a: str):
global counter
counter += 1
return a.casefold()
# Work with the database
print(counter)
# Number of times the function has been called
If you are querying rarely or your database is small, you will not see any meaningful difference. However, if you do not use an index on this function/collation, the database may perform a complete table scan applying the function/collation on each row. Depending on the size of the table, hardware, and the number of requests, the low performance may be surprising. Later I will publish a review of application-defined functions and collations performance.
Strictly speaking, collations are a bit slower than SQL functions since for each comparison they need to casefold two strings, instead of one. Although this difference is very small: in my tests, the casefold function was faster than similar collation for around 25% which amounted to a difference of 10 seconds after 100 million iterations.
Indexes and case-insensitive search
Indexes and functions
Let's start with the basics: if you define an index on any field, it will not be used in queries on a function applied to this field:
CREATE TABLE table_name (id INTEGER, name VARCHAR);
CREATE INDEX idx1 ON table_name (name);
EXPLAIN QUERY PLAN
SELECT id, name FROM table_name WHERE LOWER(name) = 'test';
-- Output: SCAN TABLE table_name
For such queries you need a separate index with the function itself:
CREATE INDEX idx1 ON table_name (LOWER(name));
EXPLAIN QUERY PLAN
SELECT id, name
FROM table_name WHERE LOWER(name) = 'test';
-- Output: SEARCH TABLE table_name USING INDEX idx1 (<expr>=?)
In SQLite, it can be done on a custom function as well, but it must be marked as deterministic (meaning that with the same inputs it returns the same result):
connection.create_function(
"CASEFOLD", 1, casefold, deterministic=True
)
After that you can create an index on a custom SQL function:
CREATE INDEX idx1
ON table_name (CASEFOLD(name));
EXPLAIN QUERY PLAN
SELECT id, name
FROM table_name WHERE CASEFOLD(name) = 'test';
-- Output: SEARCH TABLE table_name USING INDEX idx1 (<expr>=?)
Indexes and collations
The situation with collations and indexes is similar: for a query to utilize an index they must use the same collation (implied or provided expressly), otherwise, it will not work.
-- Table without specified collation will use BINARY
CREATE TABLE test (id INTEGER, text VARCHAR);
-- Create an index with a different collation
CREATE INDEX IF NOT EXISTS idx1 ON test (text COLLATE NOCASE);
-- Query will use default column collation -- BINARY
-- and the index will not be used
EXPLAIN QUERY PLAN
SELECT * FROM test WHERE text = 'test';
-- Output: SCAN TABLE test
-- Now collations match and index is used
EXPLAIN QUERY PLAN
SELECT * FROM test WHERE text = 'test' COLLATE NOCASE;
-- Output: SEARCH TABLE test USING INDEX idx1 (text=?)
As noted above, collation can be specified for a column in the table schema. This is the most convenient way – it will be applied to all queries and indexes on the respective field automatically unless you specify otherwise:
-- Using application defined collation UNICODE_NOCASE from above
CREATE TABLE test (text VARCHAR COLLATE UNICODE_NOCASE);
-- Index will be built using the collation
CREATE INDEX idx1 ON test (text);
-- Query will utilize index and collation automatically
EXPLAIN QUERY PLAN
SELECT * FROM test WHERE text = 'something';
-- Output: SEARCH TABLE test USING COVERING INDEX idx1 (text=?)
Which solution to choose?
To choose a solution we need some criteria for comparison:
Simplicity – how difficult it is to implement and maintain it
Performance – how fast your queries will be
Extra space – how much additional database space the solution requires
Coupling – how much your solution intertwines the code and storage
Solution | Simplicity | Performance (relative, without index) | Extra space | Coupling |
---|---|---|---|---|
ICU extension | Difficult: requires a new type of dependency and compiling | Medium to high | No | Yes |
Custom collation | Simple: allows to set collation in the table schema and apply it automatically to any query on the field | Low | No | Yes |
Custom SQL function | Medium: requires either building an index based on it or using in all relevant queries | Low | No | Yes |
Comparing in the app | Simple | Depends on the use case | No | No |
Storing normalized string | Medium: you need to keep the normalized string updated | Low to Medium | x2 | No |
As usual, the choice of the solution will depend on your use case and performance demands. Personally, I would go either with custom collation, comparing in the app, or storing a normalized string. For example, in listOK, I first used a collation and moved to comparing in the app when added fuzzy search.
Posted on January 23, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
January 23, 2022