Code Optimization: Filtering dataframes using exact matches in multiple columns
Maurice Borgmeier
Posted on November 17, 2023
Filtering medium to large amounts of data to extract a relevant subset is a very common task in any data related project. Often we do this on the basis of pandas dataframes. In this post I want to compare some filtering options for exact matches across multiple columns.
The idea is pretty simple. We have a dataframe with multiple columns and rows as well as a list of conditions by which we want to extract data from it. Let's take this table as an example:
job_id | started_by | job_status |
---|---|---|
1 | Jim | DONE |
2 | Jim | FAILED |
3 | Dwight | DONE |
4 | Dwight | PENDING |
5 | Pam | DONE |
6 | Pam | PENDING |
Suppose we want to get all jobs that fulfill any of these criteria:
- Started by
Pam
and in statusPENDING
- Started by
Dwight
and in statusDONE
We'd expect the jobs with the Ids 3 and 6 to be returned.
job_id | started_by | job_status |
---|---|---|
3 | Dwight | DONE |
6 | Pam | PENDING |
For our example we'll assume that we always filter based on equality, i.e. exact matches. In the real world things can be more complex, but I'm not interested in arbitrary queries here. I want to find out how to do this quickly using pandas. Of course in the actual project I want to use this in, we're talking about thousands of filter conditions and millions of rows, so this is more or less a toy example, but it will allow us to verify that a given solution is correct.
Here's the example as a python function:
import functools
import random
import typing
from time import perf_counter
import pandas as pd
from pandas.testing import assert_frame_equal
def check_that_the_implementation_is_correct(
implementation: typing.Callable[[pd.DataFrame, list[dict]], pd.DataFrame]
):
# Arrange
df_input = pd.DataFrame(
{
"job_id": [1, 2, 3, 4, 5, 6],
"started_by": ["Jim", "Jim", "Dwight", "Dwight", "Pam", "Pam"],
"job_status": ["DONE", "FAILED", "DONE", "PENDING", "DONE", "PENDING"],
}
)
filter_conditions = [
{"started_by": "Pam", "job_status": "PENDING"},
{"started_by": "Dwight", "job_status": "DONE"},
]
expected_df = pd.DataFrame(
{
"job_id": [3, 6],
"started_by": ["Dwight", "Pam"],
"job_status": ["DONE", "PENDING"],
}
)
# Act
actual_df = implementation(df_input, filter_conditions)
# Assert
actual_df = actual_df.sort_values(["job_id", "started_by", "job_status"]).reset_index(drop=True)
expected_df = expected_df.sort_values(["job_id", "started_by", "job_status"]).reset_index(drop=True)
assert_frame_equal(actual_df, expected_df)
print(implementation.__name__, "is correct")
This function takes another function as a parameter which will receive a dataframe as well as a list of filter conditions to apply to it. I don't particularly care about the indexes and ordering in the result, so I sort everything and drop the index before comparing the actual to the expected result.
Now that we're able to verify an implementation, let's create a function that can generate larger samples. Here, we don't check the correctness of the solution, we just want to see how the implementation performs with larger amounts of data.
@functools.lru_cache()
def generate_sample_data(rows=1000, columns_in_dataframe=3, columns_in_condition=2, conditions=30) -> tuple[pd.DataFrame, list[dict]]:
random_range = list(range(50))
assert columns_in_dataframe > columns_in_condition
df_rows = []
for _ in range(rows):
row_dict = {}
for col in range(columns_in_dataframe):
row_dict[f"col_{col}"] = random.choice(random_range)
df_rows.append(row_dict)
df = pd.DataFrame(df_rows)
conditions_list = []
for _ in range(conditions):
c = {}
for num in range(columns_in_condition):
c[f"col_{num}"] = random.choice(random_range)
conditions_list.append(c)
return df, conditions_list
As you can see, the function is wrapped using the functools.lru_cache
decorator, which applies memoization. This means repeated calls with the same parameters are returned from a cache, which makes this faster and each implementation will be tested with the same sample data.
To complete our setup, we define a few benchmarking configurations as well as functions to run the benchmarks and measure execution times.
benchmark_configs = [
{"rows": 1_000, "conditions": 100},
{"rows": 1_000_000, "conditions": 980},
{"rows": 1_000_000, "conditions": 1000},
{"rows": 10_000_000, "conditions": 10_000},
]
for benchmark_config in benchmark_configs:
# Prewarm benchmark configurations
_ = generate_sample_data(**benchmark_config)
def benchmark_implementation(implementation: typing.Callable[[pd.DataFrame, list[dict]], pd.DataFrame], **kwargs):
sample_df, sample_conditions = generate_sample_data(**kwargs)
start = perf_counter()
implementation(sample_df, sample_conditions)
end = perf_counter()
print(f"Implementation {implementation.__name__} took {end - start:.04f}s")
def benchmark_scenarios(implementation: typing.Callable[[pd.DataFrame, list[dict]], pd.DataFrame], scenarios: list[dict]):
check_that_the_implementation_is_correct(implementation)
for scenario in scenarios:
scenario_desc = ", ".join([f"{key}={value}" for key, value in scenario.items()])
print(f"Testing {implementation.__name__} with {scenario_desc}")
benchmark_implementation(implementation, **scenario)
We'll benchmark the implementations with four scenarios:
rows | conditions |
---|---|
1,000 | 100 |
1,000,000 | 980 |
1,000,000 | 1,000 |
10,000,000 | 10,000 |
Looking at this you might wonder about the 980 conditions and soon you shall see why we have this particular number. The code validates the implementation against our example from above and afterwards measures how long it takes to compute the result of each scenario.
Let's move on to the first implementation. Pandas has a neat query
functionality, that allows us to filter a dataframe based on a textual query. If you're familiar with SQL, you'll be very tempted to use this. The query for our example above would look as follows:
(`started_by` == Pam & `job_status` == PENDING) | (`started_by` == Dwight & `job_status` == DONE)
Pretty ugly in my opinion, but we can implement a solution that's able to build queries such as this one:
def pandas_query_based(df: pd.DataFrame, conditions: list[dict]) -> pd.DataFrame:
# build query string
inner_and_conditions =[ "(" + " & ".join(f"`{key}` == {repr(value)}" for key, value in condition.items()) + ")" for condition in conditions]
query_string = " | ".join(inner_and_conditions)
return df.query(query_string).copy()
benchmark_scenarios(pandas_query_based, benchmark_configs[:2])
Two things should jump out at you:
- It looks like gibberish
- It seems like only the first two benchmark scenarios are being executed
If you spotted both things, good job - have a cookie. Running this code gives us these performance numbers:
pandas_query_based is correct
Testing pandas_query_based with rows=1000, conditions=100
Implementation pandas_query_based took 0.0646s
Testing pandas_query_based with rows=1000000, conditions=980
Implementation pandas_query_based took 23.0720s
23 seconds for a million rows with 980 conditions doesn't seem terrible ...yet. The real problem comes when we increase the number of conditions, because the implementation uses some form of recursion under the hood and if I run the third scenario with 1000 as opposed to 980 conditions, we get a nasty maximum recursion depth error message:
# Make it go boom 💥
benchmark_scenarios(pandas_query_based, [benchmark_configs[2]])
# -> RecursionError: maximum recursion depth exceeded
This error message is the reason why I'm writing this post. In a project we encountered it, once we increased the dataset and that prompted me to look deeper. Great, now we've found a solution that only works for small to medium sized datasets. Can we do better?
A relatively fast way to access data in a dataframe uses boolean indexing. This means you extract data from a dataframe by giving it a series that contains True or False for each row. You can get to this series by doing column-wise compare operations and combining the results using logical &
and |
operators. An implementation of that looks a bit nasty as well:
def pandas_through_boolean_indexing(df: pd.DataFrame, conditions: list[dict]) -> pd.DataFrame:
at_least_one_condition_matches = None
for condition in conditions:
condition_matches = None
for col_name, col_value in condition.items():
is_match = df[col_name] == col_value
if condition_matches is None:
condition_matches = is_match
else:
condition_matches = condition_matches & is_match
if at_least_one_condition_matches is None:
at_least_one_condition_matches = condition_matches
else:
at_least_one_condition_matches = at_least_one_condition_matches | condition_matches
return df[at_least_one_condition_matches].copy()
benchmark_scenarios(pandas_through_boolean_indexing, benchmark_configs)
Nested loops as well as a bunch of if/else statements don't scream "Look at me! I'm superfast!" at you immediately, but you can see that the call to benchmark_scenarios
uses all of our benchmark configurations, so it must be a bit better. Here are the results:
pandas_through_boolean_indexing is correct
Testing pandas_through_boolean_indexing with rows=1000, conditions=100
Implementation pandas_through_boolean_indexing took 0.0073s
Testing pandas_through_boolean_indexing with rows=1000000, conditions=980
Implementation pandas_through_boolean_indexing took 0.8934s
Testing pandas_through_boolean_indexing with rows=1000000, conditions=1000
Implementation pandas_through_boolean_indexing took 0.9207s
Testing pandas_through_boolean_indexing with rows=10000000, conditions=10000
Implementation pandas_through_boolean_indexing took 99.0261s
Compared to the initial query-based implementation, this one handles all scenarios and for the ones that the first one solved, this one seems to be 8-25 times faster. For this speed-up we may tolerate slightly nasty code that requires some more thinking to understand.
We can see, however, that for the 10 million rows/ 10 thousand conditions example, it took almost 100 seconds to compute. This is bad news, because that's much closer to the numbers that this implementation will have to deal with in the real world. Let's try again.
If we assume that all conditions contain the same set of columns, we can reframe our problem. Luckily that is the case for us here. In the toy example, all conditions are based on the started_by
as well as the job_status
column. Where I need this in the real world, this is also the case.
This allows us to treat our problem as an inner join, let me show you the implementation, because after seeing the first ones, this is a feast for sore eyes.
def pandas_through_merge(df: pd.DataFrame, conditions: list[dict]) -> pd.DataFrame:
"""
NOTE: This requires all conditions to filter on the same set of columns.
"""
conditions_df = pd.DataFrame(conditions)
df = df.merge(
right=conditions_df,
how="inner",
on=list(conditions_df.columns)
)
return df
benchmark_scenarios(pandas_through_merge, benchmark_configs)
As you can see, two operations suffice. We cast our conditions into a dataframe and then perform an inner merge (the pandas equivalent of a SQL-inner-join or a set intersection on these columns) with the input dataframe and return the result. Let's see if the benchmarker likes this:
pandas_through_merge is correct
Testing pandas_through_merge with rows=1000, conditions=100
Implementation pandas_through_merge took 0.0016s
Testing pandas_through_merge with rows=1000000, conditions=980
Implementation pandas_through_merge took 0.0356s
Testing pandas_through_merge with rows=1000000, conditions=1000
Implementation pandas_through_merge took 0.0289s
Testing pandas_through_merge with rows=10000000, conditions=10000
Implementation pandas_through_merge took 1.4272s
A colleague of mine would call this blazing fast. If we compare the second benchmark to the one of the original result, this one is 640+ times faster. Comparing the largest benchmarks between the second and this implementation, we can see that the merge-based one is almost 70 times faster.
I should once again emphasize, that the last implementation only works if all conditions use the same columns, the second implementation will work even with mixed columns in the conditions.
In conclusion, friends don't let friends use query once you're handling more than a few hundred rows.
— Maurice
Versions:
- python: 3.11.5
- pandas: 2.1.1
Benchmarks executed on a M1 Pro MacBook.
Title Image by Tyler Nix on Unsplash
Posted on November 17, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 17, 2023