Lightning-fast queries with Polars
Juan Luis Cano Rodríguez
Posted on May 25, 2023
This post is an adaptation of the one I originally published in the Orchest blog. Lots of things have changed in Polars since I wrote it, but at the time of writing this lines the post still has value. Enjoy!
Polars is an open-source project that provides in-memory dataframes for Python and Rust. Despite its young age (its first commit was a mere two years ago, in the middle of the COVID-19 pandemic) it has already gained lots of popularity due to its "lightning-fast" performance and the expressiveness of its API.
One of the most interesting things about Polars is that it offers two modes of operation:
- The eager mode is somewhat similar to how pandas works: operations are executed immediately and their result is available in memory. Every operation in a chain would need to allocate a DataFrame however, which is less than ideal.
- The lazy mode, on the other hand, builds an optimized query plan that exploits parallelism as much as possible: Polars applies several simplification techniques and pushes computations to accelerate the running time as much as possible.
These ideas are not new: in fact, in my blog post about Vaex we covered its lazy computation capabilities. However, Polars takes them one step further by offering a functional API that is delightful to use.
The other secret sauce of Polars is Apache Arrow. While other libraries use Arrow for things like reading Parquet files, Polars is tightly coupled with it: by using a Rust-native implementation of the Arrow memory format for its columnar storage, Polars can leverage the highly optimized Arrow data structures and focus on the data manipulation operations.
Interested? Read on!
Polars popularity is growing fast (https://twitter.com/braaannigan/status/1526901314978029568)
First steps with Polars
For this example, we will use a sample of Stack Overflow questions and their tags obtained from Kaggle. Our generic goal is to display the most highly voted Python questions.
You can install Polars with conda/mamba or pip:
mamba install -y "polars=0.13.37"
pip install "polars==0.13.37"
Even though Polars is written in Rust, it distributes precompiled binary wheels on PyPI, so pip install will just work on all major Python versions from 3.6 onwards.
Let's load the Questions and Tags CSV files using
import polars as pl
df = pl.read_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")
tags = pl.read_csv("/data/stacksample/Tags.csv")
The type of both objects is `polars.internals.frame.DataFrame`, "a two-dimensional data structure that represents data as a table with rows and columns" (reference docs). Both dataframes have millions of rows, and the Questions one takes almost 2 GB of memory:
In [7]: len(df), len(tags)
Out[7]: (1264216, 3750994)
In [8]: print(f"Estimated size: {df.estimated_size() >> 20} MiB")
Estimated size: 1865 MiB
Polars dataframes have some typical methods we know from pandas to inspect the data. Notice that calling the print function on a DataFrame produces a tidy ASCII representation, in addition to the fancy HTML representation available in Jupyter:
In [9]: print(df.head(3)) # No `print` needed on Jupyter
shape: (3, 7)
┌─────┬─────────────┬─────────────────┬─────────────────┬───────┬─────────────────┬────────────────┐
│ Id ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score ┆ Title ┆ Body │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ i64 ┆ str ┆ str │
╞═════╪═════════════╪═════════════════╪═════════════════╪═══════╪═════════════════╪════════════════╡
│ 80 ┆ 26 ┆ 2008-08-01T13:5 ┆ NA ┆ 26 ┆ SQLStatement.ex ┆ <p>I've │
│ ┆ ┆ 7:07Z ┆ ┆ ┆ ecute() - ┆ written a │
│ ┆ ┆ ┆ ┆ ┆ multipl... ┆ database │
│ ┆ ┆ ┆ ┆ ┆ ┆ gener... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 90 ┆ 58 ┆ 2008-08-01T14:4 ┆ 2012-12-26T03:4 ┆ 144 ┆ Good branching ┆ <p>Are there │
│ ┆ ┆ 1:24Z ┆ 5:49Z ┆ ┆ and merging ┆ any really │
│ ┆ ┆ ┆ ┆ ┆ tutor... ┆ good tut... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 120 ┆ 83 ┆ 2008-08-01T15:5 ┆ NA ┆ 21 ┆ ASP.NET Site ┆ <p>Has anyone │
│ ┆ ┆ 0:08Z ┆ ┆ ┆ Maps ┆ got experience │
│ ┆ ┆ ┆ ┆ ┆ ┆ cre... │
└─────┴─────────────┴─────────────────┴─────────────────┴───────┴─────────────────┴────────────────┘
In [10]: print(df.describe())
shape: (5, 8)
┌──────────┬─────────────┬─────────────┬──────────────┬────────────┬───────────┬───────┬──────┐
│ describe ┆ Id ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score ┆ Title ┆ Body │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str ┆ str ┆ str ┆ f64 ┆ str ┆ str │
╞══════════╪═════════════╪═════════════╪══════════════╪════════════╪═══════════╪═══════╪══════╡
│ mean ┆ 2.1327e7 ┆ null ┆ null ┆ null ┆ 1.781537 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ std ┆ 1.1514e7 ┆ null ┆ null ┆ null ┆ 13.663886 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ min ┆ 80.0 ┆ null ┆ null ┆ null ┆ -73.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ max ┆ 4.014338e7 ┆ null ┆ null ┆ null ┆ 5190.0 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ median ┆ 2.1725415e7 ┆ null ┆ null ┆ null ┆ 0.0 ┆ null ┆ null │
└──────────┴─────────────┴─────────────┴──────────────┴────────────┴───────────┴───────┴──────┘
[11]: print(tags["Tag"].value_counts().head())
shape: (5, 2)
┌────────────┬────────┐
│ Tag ┆ counts │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪════════╡
│ javascript ┆ 124155 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ java ┆ 115212 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c# ┆ 101186 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ php ┆ 98808 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ android ┆ 90659 │
└────────────┴────────┘
Following a terminology similar to pandas, Polars dataframes contain several columns of type polars.internals.series.Series , each of them with a different data type:
In [12]: df["Title"].head(5)
Out[12]: shape: (5,)
Series: 'Title' [str]
[
"SQLStatement.e...
"Good branching...
"ASP.NET Site M...
"Function for c...
"Adding scripti...
]
In [13]: df.dtypes
Out[13]: [polars.datatypes.Int64,
polars.datatypes.Utf8,
polars.datatypes.Utf8,
polars.datatypes.Utf8,
polars.datatypes.Int64,
polars.datatypes.Utf8,
polars.datatypes.Utf8]
Expressions as chained operations on columns
The essential building blocks in Polars are expressions: functions that receive a Series and transform it into another Series. Expressions start with a root, and then you can chain more operations:
(
pl.col("Score") # Root of the Expression (a single column)
.mean() # Returns another Expression
)
The most interesting feature is that expressions are not bound to a specific object, but instead they are generic. Chains of expressions define the computation, which is materialized by a DataFrame method (acting as an execution context).
Sounds too abstract? See it in action:
In [20]: print(df.select(pl.col("Score").mean()))
shape: (1, 1)
┌──────────┐
│ Score │
│ --- │
│ f64 │
╞══════════╡
│ 1.781537 │
└──────────┘
The df.select method can do much more than just selecting columns: it can execute any column-wise expression. In fact, when passed a list of such expressions, it can broadcast them automatically if the dimensions are coherent, and it will execute them in parallel:
In [21]: print(df.select([
...: pl.col("Id").n_unique().alias("num_unique_users"),
...: pl.col("Score").mean().alias("mean_score"),
...: pl.col("Title").str.lengths().max().alias("max_title_length"),
...: # To run the above in all text columns,
...: # you can filter by data type:
...: # pl.col(Utf8).str.lengths().max().suffix("_max_length"),
...: ]))
shape: (1, 3)
┌──────────────────┬────────────┬──────────────────┐
│ num_unique_users ┆ mean_score ┆ max_title_length │
│ --- ┆ --- ┆ --- │
│ u32 ┆ f64 ┆ u32 │
╞══════════════════╪════════════╪══════════════════╡
│ 1264216 ┆ 1.781537 ┆ 204 │
└──────────────────┴────────────┴──────────────────┘
The power of laziness
It is now time to start narrowing down the analysis a bit and focus on the questions that are related to Python. Notice that Polars algorithms require all the data to live in memory, and therefore when using the eager API you have to apply the usual caveats about large datasets. As a result, since the questions dataset is already quite big, performing a .join operation with the tags data can crash the kernel:
# Don't try this at home unless you have enough RAM!
# (
# df
# .join(tags, on="Id")
# .filter(pl.col("Tag").str.contains(r"(i?)python"))
# .sort("Id")
# )
But fear not, because Polars has the perfect solution: switching to lazy mode! By prefixing our chain of operations by .lazy() and calling .collect() at the end, you can leverage Polars optimization capabilities to its fullest potential, and perform operations that would be otherwise impossible:
In [22]: q_python = (
...: df.lazy() # Notice the .lazy() call
...: # The input of a lazy join needs to be lazy
...: # We use a 'semi' join, like 'inner' but discarding extra columns
...: .join(tags.lazy(), on="Id", how="semi")
...: .filter(pl.col("Tag").str.contains(r"(i?)python"))
...: .sort("Id")
...: ).collect() # Call .collect() at the end
...: print(q_python.head(3))
shape: (3, 7)
┌───────┬─────────────┬──────────────────┬────────────┬───────┬──────────────────┬─────────────────┐
│ Id ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score ┆ Title ┆ Body │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ i64 ┆ str ┆ str │
╞═══════╪═════════════╪══════════════════╪════════════╪═══════╪══════════════════╪═════════════════╡
│ 11060 ┆ 912 ┆ 2008-08-14T13:59 ┆ NA ┆ 18 ┆ How should I ┆ <p>This is a │
│ ┆ ┆ :21Z ┆ ┆ ┆ unit test a ┆ difficult and │
│ ┆ ┆ ┆ ┆ ┆ code-ge... ┆ open-... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 17250 ┆ 394 ┆ 2008-08-20T00:16 ┆ NA ┆ 24 ┆ Create an ┆ <p>I'm creating │
│ ┆ ┆ :40Z ┆ ┆ ┆ encrypted ZIP ┆ an ZIP file │
│ ┆ ┆ ┆ ┆ ┆ file in ... ┆ with... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 19030 ┆ 745 ┆ 2008-08-20T22:50 ┆ NA ┆ 2 ┆ How to check set ┆ <p>I have a │
│ ┆ ┆ :55Z ┆ ┆ ┆ of files ┆ bunch of files │
│ ┆ ┆ ┆ ┆ ┆ confor... ┆ (TV e... │
In fact, if your raw CSV is so big that it doesn't fit in RAM to start, Polars offers a lazy way of reading the file too using scan_csv :
# We create the query plan separately
plan = (
# scan_csv returns a lazy dataframe already
pl.scan_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")
.join(tags.lazy(), on="Id", how="semi")
.filter(pl.col("Tag").str.contains(r"(i?)python"))
.sort("Score", reverse=True)
.limit(1_000)
)
top_voted_python_qs = plan.collect()
If you are curious about how Polars is doing all this work under the hood, notice that you can visualize the query plan!
Polars visualization of a query plan (not optimized)
Working with columns of lists
Notice that, in the previous section, we did a "semi" join to filter the questions, but we still don't have the list of tags associated with such questions. To achieve that, we will use one of the most surprisingly pleasant features of Polars: its list-column handling.
In [30]: tag_list_lazy = (
...: tags.lazy()
...: .groupby("Id").agg(
...: pl.col("Tag")
...: .list() # Convert to a list of strings
...: .alias("TagList")
...: )
...: )
...: print(tag_list_lazy.limit(5).collect())
shape: (5, 2)
┌──────────┬─────────────────────────────────────┐
│ Id ┆ TagList │
│ --- ┆ --- │
│ i64 ┆ list [str] │
╞══════════╪═════════════════════════════════════╡
│ 994990 ┆ ["spring"] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 29087440 ┆ ["android", "android-intent"] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 12093870 ┆ ["asp.net", ".net", "sqldatasour... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 32889780 ┆ ["c", "extern", "function-declar... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 22436290 ┆ ["mysql", "sql", ... "multiple-t... │
└──────────┴─────────────────────────────────────┘
After grouping by "Id" and turning each row into a list of tags, it's time to add a boolean column "ContainsPython" that signals whether any of the tags in the list contains the substring "python". For that', let's use the _.arr.eval_
context (also known as the List context):
tag_list_extended_lazy = tag_list_lazy.with_column(
pl.col("TagList")
.arr.eval(
pl.element()
.str.contains(r"(i?)python")
.any()
).flatten().alias("ContainsPython")
)
The final join will provide the answer we are looking for:
top_python_questions = (
pl.scan_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")
.join(tag_list_extended_lazy, on="Id")
.filter(pl.col("ContainsPython"))
.sort("Score", reverse=True)
).limit(1_000).collect()
And the result:
Joining two dataframes in Polars
Very neat!
Some differences with pandas
Similarly to what happens with Vaex, Polars DataFrames don't have an index. The user guide goes as far as saying this:
Indexes are not needed! Not having them makes things easier - convince us otherwise!
The discussion of this contentious stance will be the subject of a future blog post. In any case, this allows Polars to simplify indexing operations, since strings will always refer to column names, and numbers in the first axis will always refer to row numbers:
In [36]: print(df[0]) # First row
shape: (1, 7)
┌─────┬─────────────┬───────────────────┬────────────┬───────┬──────────────────┬──────────────────┐
│ Id ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score ┆ Title ┆ Body │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ i64 ┆ str ┆ str │
╞═════╪═════════════╪═══════════════════╪════════════╪═══════╪══════════════════╪══════════════════╡
│ 80 ┆ 26 ┆ 2008-08-01T13:57: ┆ NA ┆ 26 ┆ SQLStatement.exe ┆ <p>I've written │
│ ┆ ┆ 07Z ┆ ┆ ┆ cute() - ┆ a database │
│ ┆ ┆ ┆ ┆ ┆ multipl... ┆ gener... │
└─────┴─────────────┴───────────────────┴────────────┴───────┴──────────────────┴──────────────────┘
[37]: df[0, 0] # First row, first column
Out[37]: 80
In [38]: df[0, "Id"] # First row, column by name
Out[38]: 80
In [39]: df["Id"].head(5) # Column by name
Out[39]: shape: (5,)
Series: 'Id' [i64]
[
80
90
120
180
260
]
On the other hand, even though indexing with boolean masks is supported in Polars as a way to bridge the gap with Pandas users, its use is discouraged in favor of select and filter, and "the functionality may be removed in the future". However, as you could see in the examples above, direct indexing is not needed as often as in pandas.
Should you use Polars?
Beyond this short introduction, Polars has much more to offer, from window functions and complex aggregations to time-series processing, and much more.
As a downside, since it is a young project and it's evolving quite fast, you will notice that some areas of the documentation are a bit lacking, or that there are no comprehensive release notes yet. Fortunately, Ritchie Vink, the Polars creator and current maintainer, quickly answers Stack Overflow questions and GitHub issues, and releases with bug fixes and new features are frequent.
On the other hand, if you are looking for an ultimate solution for your larger-than-RAM datasets, Polars might not be for you. Its lazy processing capabilities can take you quite far, but at some point you will have to confront the fact that Polars is an in-memory dataframe library, similar to pandas.
In summary:
- Use Polars if you are willing to learn a different but powerful new API, if your data fits in memory, if your workflow involves lots of list-column manipulation, and in general if you want to explore a much faster alternative to pandas.
- Don't use Polars if your data is much larger than RAM, if you are looking for solutions to quickly migrate a large pandas codebase, or if you are looking for an old, battle-tested library.
Posted on May 25, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.