Code Optimization: Filtering dataframes using exact matches in multiple columns

mauricebrg

Maurice Borgmeier

Posted on November 17, 2023

Code Optimization: Filtering dataframes using exact matches in multiple columns

Filtering medium to large amounts of data to extract a relevant subset is a very common task in any data related project. Often we do this on the basis of pandas dataframes. In this post I want to compare some filtering options for exact matches across multiple columns.

The idea is pretty simple. We have a dataframe with multiple columns and rows as well as a list of conditions by which we want to extract data from it. Let's take this table as an example:

job_id started_by job_status
1 Jim DONE
2 Jim FAILED
3 Dwight DONE
4 Dwight PENDING
5 Pam DONE
6 Pam PENDING

Suppose we want to get all jobs that fulfill any of these criteria:

  1. Started by Pam and in status PENDING
  2. Started by Dwight and in status DONE

We'd expect the jobs with the Ids 3 and 6 to be returned.

job_id started_by job_status
3 Dwight DONE
6 Pam PENDING

For our example we'll assume that we always filter based on equality, i.e. exact matches. In the real world things can be more complex, but I'm not interested in arbitrary queries here. I want to find out how to do this quickly using pandas. Of course in the actual project I want to use this in, we're talking about thousands of filter conditions and millions of rows, so this is more or less a toy example, but it will allow us to verify that a given solution is correct.

Here's the example as a python function:

import functools
import random
import typing

from time import perf_counter

import pandas as pd
from pandas.testing import assert_frame_equal

def check_that_the_implementation_is_correct(
    implementation: typing.Callable[[pd.DataFrame, list[dict]], pd.DataFrame]
):
    # Arrange
    df_input = pd.DataFrame(
        {
            "job_id": [1, 2, 3, 4, 5, 6],
            "started_by": ["Jim", "Jim", "Dwight", "Dwight", "Pam", "Pam"],
            "job_status": ["DONE", "FAILED", "DONE", "PENDING", "DONE", "PENDING"],
        }
    )

    filter_conditions = [
        {"started_by": "Pam", "job_status": "PENDING"},
        {"started_by": "Dwight", "job_status": "DONE"},
    ]

    expected_df = pd.DataFrame(
        {
            "job_id": [3, 6],
            "started_by": ["Dwight", "Pam"],
            "job_status": ["DONE", "PENDING"],
        }
    )

    # Act

    actual_df = implementation(df_input, filter_conditions)

    # Assert

    actual_df = actual_df.sort_values(["job_id", "started_by", "job_status"]).reset_index(drop=True)
    expected_df = expected_df.sort_values(["job_id", "started_by", "job_status"]).reset_index(drop=True)

    assert_frame_equal(actual_df, expected_df)
    print(implementation.__name__, "is correct")
Enter fullscreen mode Exit fullscreen mode

This function takes another function as a parameter which will receive a dataframe as well as a list of filter conditions to apply to it. I don't particularly care about the indexes and ordering in the result, so I sort everything and drop the index before comparing the actual to the expected result.

Now that we're able to verify an implementation, let's create a function that can generate larger samples. Here, we don't check the correctness of the solution, we just want to see how the implementation performs with larger amounts of data.

@functools.lru_cache()
def generate_sample_data(rows=1000, columns_in_dataframe=3, columns_in_condition=2, conditions=30) -> tuple[pd.DataFrame, list[dict]]:

    random_range = list(range(50))
    assert columns_in_dataframe > columns_in_condition

    df_rows = []

    for _ in range(rows):
        row_dict = {}
        for col in range(columns_in_dataframe):
            row_dict[f"col_{col}"] = random.choice(random_range)

        df_rows.append(row_dict)

    df = pd.DataFrame(df_rows)

    conditions_list = []

    for _ in range(conditions):
        c = {}
        for num in range(columns_in_condition):
            c[f"col_{num}"] = random.choice(random_range)
        conditions_list.append(c)

    return df, conditions_list
Enter fullscreen mode Exit fullscreen mode

As you can see, the function is wrapped using the functools.lru_cache decorator, which applies memoization. This means repeated calls with the same parameters are returned from a cache, which makes this faster and each implementation will be tested with the same sample data.

To complete our setup, we define a few benchmarking configurations as well as functions to run the benchmarks and measure execution times.

benchmark_configs = [
    {"rows": 1_000, "conditions": 100},
    {"rows": 1_000_000, "conditions": 980},
    {"rows": 1_000_000, "conditions": 1000},
    {"rows": 10_000_000, "conditions": 10_000},
]

for benchmark_config in benchmark_configs:
    # Prewarm benchmark configurations
    _ = generate_sample_data(**benchmark_config)

def benchmark_implementation(implementation: typing.Callable[[pd.DataFrame, list[dict]], pd.DataFrame], **kwargs):

    sample_df, sample_conditions = generate_sample_data(**kwargs)

    start = perf_counter()
    implementation(sample_df, sample_conditions)
    end = perf_counter()

    print(f"Implementation {implementation.__name__} took {end - start:.04f}s")

def benchmark_scenarios(implementation: typing.Callable[[pd.DataFrame, list[dict]], pd.DataFrame], scenarios: list[dict]):

    check_that_the_implementation_is_correct(implementation)

    for scenario in scenarios:
        scenario_desc = ", ".join([f"{key}={value}" for key, value in scenario.items()])
        print(f"Testing {implementation.__name__} with {scenario_desc}")
        benchmark_implementation(implementation, **scenario)
Enter fullscreen mode Exit fullscreen mode

We'll benchmark the implementations with four scenarios:

rows conditions
1,000 100
1,000,000 980
1,000,000 1,000
10,000,000 10,000

Looking at this you might wonder about the 980 conditions and soon you shall see why we have this particular number. The code validates the implementation against our example from above and afterwards measures how long it takes to compute the result of each scenario.

Let's move on to the first implementation. Pandas has a neat query functionality, that allows us to filter a dataframe based on a textual query. If you're familiar with SQL, you'll be very tempted to use this. The query for our example above would look as follows:

(`started_by` == Pam & `job_status` == PENDING) | (`started_by` == Dwight & `job_status` == DONE)
Enter fullscreen mode Exit fullscreen mode

Pretty ugly in my opinion, but we can implement a solution that's able to build queries such as this one:

def pandas_query_based(df: pd.DataFrame, conditions: list[dict]) -> pd.DataFrame:

    # build query string
    inner_and_conditions =[ "(" + " & ".join(f"`{key}` == {repr(value)}" for key, value in condition.items()) + ")" for condition in conditions]
    query_string = " | ".join(inner_and_conditions)

    return df.query(query_string).copy()

benchmark_scenarios(pandas_query_based, benchmark_configs[:2])
Enter fullscreen mode Exit fullscreen mode

Two things should jump out at you:

  1. It looks like gibberish
  2. It seems like only the first two benchmark scenarios are being executed

If you spotted both things, good job - have a cookie. Running this code gives us these performance numbers:

pandas_query_based is correct
Testing pandas_query_based with rows=1000, conditions=100
Implementation pandas_query_based took 0.0646s
Testing pandas_query_based with rows=1000000, conditions=980
Implementation pandas_query_based took 23.0720s
Enter fullscreen mode Exit fullscreen mode

23 seconds for a million rows with 980 conditions doesn't seem terrible ...yet. The real problem comes when we increase the number of conditions, because the implementation uses some form of recursion under the hood and if I run the third scenario with 1000 as opposed to 980 conditions, we get a nasty maximum recursion depth error message:

# Make it go boom 💥
benchmark_scenarios(pandas_query_based, [benchmark_configs[2]])

# -> RecursionError: maximum recursion depth exceeded
Enter fullscreen mode Exit fullscreen mode

This error message is the reason why I'm writing this post. In a project we encountered it, once we increased the dataset and that prompted me to look deeper. Great, now we've found a solution that only works for small to medium sized datasets. Can we do better?

A relatively fast way to access data in a dataframe uses boolean indexing. This means you extract data from a dataframe by giving it a series that contains True or False for each row. You can get to this series by doing column-wise compare operations and combining the results using logical & and | operators. An implementation of that looks a bit nasty as well:

def pandas_through_boolean_indexing(df: pd.DataFrame, conditions: list[dict]) -> pd.DataFrame:

    at_least_one_condition_matches = None
    for condition in conditions:

        condition_matches = None

        for col_name, col_value in condition.items():
            is_match = df[col_name] == col_value

            if condition_matches is None:
                condition_matches = is_match
            else:
                condition_matches = condition_matches & is_match

        if at_least_one_condition_matches is None:
            at_least_one_condition_matches = condition_matches
        else:
            at_least_one_condition_matches = at_least_one_condition_matches | condition_matches

    return df[at_least_one_condition_matches].copy()

benchmark_scenarios(pandas_through_boolean_indexing, benchmark_configs)
Enter fullscreen mode Exit fullscreen mode

Nested loops as well as a bunch of if/else statements don't scream "Look at me! I'm superfast!" at you immediately, but you can see that the call to benchmark_scenarios uses all of our benchmark configurations, so it must be a bit better. Here are the results:

pandas_through_boolean_indexing is correct
Testing pandas_through_boolean_indexing with rows=1000, conditions=100
Implementation pandas_through_boolean_indexing took 0.0073s
Testing pandas_through_boolean_indexing with rows=1000000, conditions=980
Implementation pandas_through_boolean_indexing took 0.8934s
Testing pandas_through_boolean_indexing with rows=1000000, conditions=1000
Implementation pandas_through_boolean_indexing took 0.9207s
Testing pandas_through_boolean_indexing with rows=10000000, conditions=10000
Implementation pandas_through_boolean_indexing took 99.0261s
Enter fullscreen mode Exit fullscreen mode

Compared to the initial query-based implementation, this one handles all scenarios and for the ones that the first one solved, this one seems to be 8-25 times faster. For this speed-up we may tolerate slightly nasty code that requires some more thinking to understand.

We can see, however, that for the 10 million rows/ 10 thousand conditions example, it took almost 100 seconds to compute. This is bad news, because that's much closer to the numbers that this implementation will have to deal with in the real world. Let's try again.

If we assume that all conditions contain the same set of columns, we can reframe our problem. Luckily that is the case for us here. In the toy example, all conditions are based on the started_by as well as the job_status column. Where I need this in the real world, this is also the case.

This allows us to treat our problem as an inner join, let me show you the implementation, because after seeing the first ones, this is a feast for sore eyes.

def pandas_through_merge(df: pd.DataFrame, conditions: list[dict]) -> pd.DataFrame:
    """
    NOTE: This requires all conditions to filter on the same set of columns.
    """

    conditions_df = pd.DataFrame(conditions)

    df = df.merge(
        right=conditions_df,
        how="inner",
        on=list(conditions_df.columns)
    )

    return df

benchmark_scenarios(pandas_through_merge, benchmark_configs)
Enter fullscreen mode Exit fullscreen mode

As you can see, two operations suffice. We cast our conditions into a dataframe and then perform an inner merge (the pandas equivalent of a SQL-inner-join or a set intersection on these columns) with the input dataframe and return the result. Let's see if the benchmarker likes this:

pandas_through_merge is correct
Testing pandas_through_merge with rows=1000, conditions=100
Implementation pandas_through_merge took 0.0016s
Testing pandas_through_merge with rows=1000000, conditions=980
Implementation pandas_through_merge took 0.0356s
Testing pandas_through_merge with rows=1000000, conditions=1000
Implementation pandas_through_merge took 0.0289s
Testing pandas_through_merge with rows=10000000, conditions=10000
Implementation pandas_through_merge took 1.4272s
Enter fullscreen mode Exit fullscreen mode

A colleague of mine would call this blazing fast. If we compare the second benchmark to the one of the original result, this one is 640+ times faster. Comparing the largest benchmarks between the second and this implementation, we can see that the merge-based one is almost 70 times faster.

I should once again emphasize, that the last implementation only works if all conditions use the same columns, the second implementation will work even with mixed columns in the conditions.

In conclusion, friends don't let friends use query once you're handling more than a few hundred rows.

— Maurice


Versions:

  • python: 3.11.5
  • pandas: 2.1.1

Benchmarks executed on a M1 Pro MacBook.

Title Image by Tyler Nix on Unsplash

💖 💪 🙅 🚩
mauricebrg
Maurice Borgmeier

Posted on November 17, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related