Lightning-fast queries with Polars

astrojuanlu

Juan Luis Cano Rodríguez

Posted on May 25, 2023

Lightning-fast queries with Polars

This post is an adaptation of the one I originally published in the Orchest blog. Lots of things have changed in Polars since I wrote it, but at the time of writing this lines the post still has value. Enjoy!

Polars is an open-source project that provides in-memory dataframes for Python and Rust. Despite its young age (its first commit was a mere two years ago, in the middle of the COVID-19 pandemic) it has already gained lots of popularity due to its "lightning-fast" performance and the expressiveness of its API.

One of the most interesting things about Polars is that it offers two modes of operation:

  • The eager mode is somewhat similar to how pandas works: operations are executed immediately and their result is available in memory. Every operation in a chain would need to allocate a DataFrame however, which is less than ideal.
  • The lazy mode, on the other hand, builds an optimized query plan that exploits parallelism as much as possible: Polars applies several simplification techniques and pushes computations to accelerate the running time as much as possible.

These ideas are not new: in fact, in my blog post about Vaex we covered its lazy computation capabilities. However, Polars takes them one step further by offering a functional API that is delightful to use.

The other secret sauce of Polars is Apache Arrow. While other libraries use Arrow for things like reading Parquet files, Polars is tightly coupled with it: by using a Rust-native implementation of the Arrow memory format for its columnar storage, Polars can leverage the highly optimized Arrow data structures and focus on the data manipulation operations.

Interested? Read on!

Polars popularity is growing fast

Polars popularity is growing fast (https://twitter.com/braaannigan/status/1526901314978029568)

First steps with Polars

For this example, we will use a sample of Stack Overflow questions and their tags obtained from Kaggle. Our generic goal is to display the most highly voted Python questions.

You can install Polars with conda/mamba or pip:



mamba install -y "polars=0.13.37"  
pip install "polars==0.13.37"


Enter fullscreen mode Exit fullscreen mode

Even though Polars is written in Rust, it distributes precompiled binary wheels on PyPI, so pip install will just work on all major Python versions from 3.6 onwards.

Let's load the Questions and Tags CSV files using



import polars as pl  

df = pl.read_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")  
tags = pl.read_csv("/data/stacksample/Tags.csv")


Enter fullscreen mode Exit fullscreen mode

The type of both objects is `polars.internals.frame.DataFrame`, "a two-dimensional data structure that represents data as a table with rows and columns" (reference docs). Both dataframes have millions of rows, and the Questions one takes almost 2 GB of memory:



In [7]: len(df), len(tags)  
Out[7]: (1264216, 3750994)  

In [8]: print(f"Estimated size: {df.estimated_size() >> 20} MiB")  
Estimated size: 1865 MiB


Enter fullscreen mode Exit fullscreen mode

Polars dataframes have some typical methods we know from pandas to inspect the data. Notice that calling the print function on a DataFrame produces a tidy ASCII representation, in addition to the fancy HTML representation available in Jupyter:



In [9]: print(df.head(3))  # No `print` needed on Jupyter  
shape: (3, 7)  
┌─────┬─────────────┬─────────────────┬─────────────────┬───────┬─────────────────┬────────────────┐  
 Id  ┆ OwnerUserId  CreationDate    ┆ ClosedDate      ┆ Score  Title            Body             
 ---  ---          ---              ---              ---    ---              ---            │  
 i64  str          str              str              i64    str              str            │  
╞═════╪═════════════╪═════════════════╪═════════════════╪═══════╪═════════════════╪════════════════╡  
 80  ┆ 26          ┆ 2008-08-01T13:5  NA              ┆ 26    ┆ SQLStatement.ex  <p>I've        │  
│     ┆             ┆ 7:07Z           ┆                 ┆       ┆ ecute() -       ┆ written a      │  
│     ┆             ┆                 ┆                 ┆       ┆ multipl...      ┆ database       │  
│     ┆             ┆                 ┆                 ┆       ┆                 ┆ gener...       │  
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 90  ┆ 58          ┆ 2008-08-01T14:4 ┆ 2012-12-26T03:4 ┆ 144   ┆ Good branching  ┆ <p>Are there   │  
│     ┆             ┆ 1:24Z           ┆ 5:49Z           ┆       ┆ and merging     ┆ any really     │  
│     ┆             ┆                 ┆                 ┆       ┆ tutor...        ┆ good tut...    │  
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 120 ┆ 83          ┆ 2008-08-01T15:5 ┆ NA              ┆ 21    ┆ ASP.NET Site    ┆ <p>Has anyone  │  
│     ┆             ┆ 0:08Z           ┆                 ┆       ┆ Maps            ┆ got experience │  
│     ┆             ┆                 ┆                 ┆       ┆                 ┆ cre...         │  
└─────┴─────────────┴─────────────────┴─────────────────┴───────┴─────────────────┴────────────────┘  
In [10]: print(df.describe())  
shape: (5, 8)  
┌──────────┬─────────────┬─────────────┬──────────────┬────────────┬───────────┬───────┬──────┐  
│ describe ┆ Id          ┆ OwnerUserId ┆ CreationDate ┆ ClosedDate ┆ Score     ┆ Title ┆ Body │  
│ ---      ┆ ---         ┆ ---         ┆ ---          ┆ ---        ┆ ---       ┆ ---   ┆ ---  │  
│ str      ┆ f64         ┆ str         ┆ str          ┆ str        ┆ f64       ┆ str   ┆ str  │  
╞══════════╪═════════════╪═════════════╪══════════════╪════════════╪═══════════╪═══════╪══════╡  
│ mean     ┆ 2.1327e7    ┆ null        ┆ null         ┆ null       ┆ 1.781537  ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ std      ┆ 1.1514e7    ┆ null        ┆ null         ┆ null       ┆ 13.663886 ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ min      ┆ 80.0        ┆ null        ┆ null         ┆ null       ┆ -73.0     ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ max      ┆ 4.014338e7  ┆ null        ┆ null         ┆ null       ┆ 5190.0    ┆ null  ┆ null │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤  
│ median   ┆ 2.1725415e7 ┆ null        ┆ null         ┆ null       ┆ 0.0       ┆ null  ┆ null │  
└──────────┴─────────────┴─────────────┴──────────────┴────────────┴───────────┴───────┴──────┘  
[11]: print(tags["Tag"].value_counts().head())  
shape: (5, 2)  
┌────────────┬────────┐  
│ Tag        ┆ counts │  
│ ---        ┆ ---    │  
│ str        ┆ u32    │  
╞════════════╪════════╡  
│ javascript ┆ 124155 │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ java       ┆ 115212 │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ c#         ┆ 101186 │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ php        ┆ 98808  │  
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤  
│ android    ┆ 90659  │  
└────────────┴────────┘


Enter fullscreen mode Exit fullscreen mode

Following a terminology similar to pandas, Polars dataframes contain several columns of type polars.internals.series.Series , each of them with a different data type:



In [12]: df["Title"].head(5)  
Out[12]: shape: (5,)  
Series: 'Title' [str]  
[  
"SQLStatement.e...  
"Good branching...  
"ASP.NET Site M...  
"Function for c...  
"Adding scripti...  
]  

In [13]: df.dtypes  
Out[13]: [polars.datatypes.Int64,  
polars.datatypes.Utf8,  
polars.datatypes.Utf8,  
polars.datatypes.Utf8,  
polars.datatypes.Int64,  
polars.datatypes.Utf8,  
polars.datatypes.Utf8]


Enter fullscreen mode Exit fullscreen mode

Expressions as chained operations on columns

The essential building blocks in Polars are expressions: functions that receive a Series and transform it into another Series. Expressions start with a root, and then you can chain more operations:



(  
   pl.col("Score")  # Root of the Expression (a single column)  
   .mean()  # Returns another Expression  
)


Enter fullscreen mode Exit fullscreen mode

The most interesting feature is that expressions are not bound to a specific object, but instead they are generic. Chains of expressions define the computation, which is materialized by a DataFrame method (acting as an execution context).

Sounds too abstract? See it in action:



In [20]: print(df.select(pl.col("Score").mean()))  
shape: (1, 1)  
┌──────────┐  
 Score    │  
 ---      │  
 f64      │  
╞══════════╡  
 1.781537   
└──────────┘


Enter fullscreen mode Exit fullscreen mode

The df.select method can do much more than just selecting columns: it can execute any column-wise expression. In fact, when passed a list of such expressions, it can broadcast them automatically if the dimensions are coherent, and it will execute them in parallel:



In [21]: print(df.select([  
   ...:   pl.col("Id").n_unique().alias("num_unique_users"),  
   ...:   pl.col("Score").mean().alias("mean_score"),  
   ...:   pl.col("Title").str.lengths().max().alias("max_title_length"),  
   ...:   # To run the above in all text columns,  
   ...:   # you can filter by data type:  
   ...:   # pl.col(Utf8).str.lengths().max().suffix("_max_length"),  
   ...: ]))  
shape: (1, 3)  
┌──────────────────┬────────────┬──────────────────┐  
 num_unique_users  mean_score  max_title_length   
 ---              ┆ ---        ┆ ---              │  
 u32              ┆ f64        ┆ u32              │  
╞══════════════════╪════════════╪══════════════════╡  
 1264216          ┆ 1.781537    204              │  
└──────────────────┴────────────┴──────────────────┘


Enter fullscreen mode Exit fullscreen mode

The power of laziness

It is now time to start narrowing down the analysis a bit and focus on the questions that are related to Python. Notice that Polars algorithms require all the data to live in memory, and therefore when using the eager API you have to apply the usual caveats about large datasets. As a result, since the questions dataset is already quite big, performing a .join  operation with the tags data can crash the kernel:



# Don't try this at home unless you have enough RAM!  
# (  
#     df  
#     .join(tags, on="Id")  
#     .filter(pl.col("Tag").str.contains(r"(i?)python"))  
#     .sort("Id")  
# )


Enter fullscreen mode Exit fullscreen mode

But fear not, because Polars has the perfect solution: switching to lazy mode! By prefixing our chain of operations by .lazy()  and calling .collect()  at the end, you can leverage Polars optimization capabilities to its fullest potential, and perform operations that would be otherwise impossible:



In [22]: q_python = (  
   ...:   df.lazy()  # Notice the .lazy() call  
   ...:   # The input of a lazy join needs to be lazy  
   ...:   # We use a 'semi' join, like 'inner' but discarding extra columns  
   ...:   .join(tags.lazy(), on="Id", how="semi")  
   ...:   .filter(pl.col("Tag").str.contains(r"(i?)python"))  
   ...:   .sort("Id")  
   ...: ).collect()  # Call .collect() at the end  
   ...: print(q_python.head(3))  
shape: (3, 7)  
┌───────┬─────────────┬──────────────────┬────────────┬───────┬──────────────────┬─────────────────┐  
 Id    ┆ OwnerUserId  CreationDate      ClosedDate  Score  Title            ┆ Body            │  
 ---    ---          ---              ┆ ---        ┆ ---    ---              ┆ ---               
 i64    str          str              ┆ str        ┆ i64    str              ┆ str               
╞═══════╪═════════════╪══════════════════╪════════════╪═══════╪══════════════════╪═════════════════╡  
 11060  912          2008-08-14T13:59  NA          18    ┆ How should I      <p>This is a    │  
                     :21Z                         ┆        unit test a      ┆ difficult and     
                                      ┆            ┆        code-ge...        open-...        │  
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
 17250  394          2008-08-20T00:16  NA          24    ┆ Create an        ┆ <p>I'm creating │  
│       ┆             ┆ :40Z             ┆            ┆       ┆ encrypted ZIP    ┆ an ZIP file     │  
│       ┆             ┆                  ┆            ┆       ┆ file in ...      ┆ with...         │  
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 19030 ┆ 745         ┆ 2008-08-20T22:50 ┆ NA         ┆ 2     ┆ How to check set ┆ <p>I have a     │  
│       ┆             ┆ :55Z             ┆            ┆       ┆ of files         ┆ bunch of files  │  
│       ┆             ┆                  ┆            ┆       ┆ confor...        ┆ (TV e...        │


Enter fullscreen mode Exit fullscreen mode

In fact, if your raw CSV is so big that it doesn't fit in RAM to start, Polars offers a lazy way of reading the file too using scan_csv :



# We create the query plan separately  
plan = (  
   # scan_csv returns a lazy dataframe already  
   pl.scan_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")  
   .join(tags.lazy(), on="Id", how="semi")  
   .filter(pl.col("Tag").str.contains(r"(i?)python"))  
   .sort("Score", reverse=True)  
   .limit(1_000)  
)  
top_voted_python_qs = plan.collect()


Enter fullscreen mode Exit fullscreen mode

If you are curious about how Polars is doing all this work under the hood, notice that you can visualize the query plan!

Polars visualization of a query plan (not optimized)

Polars visualization of a query plan (not optimized)

Working with columns of lists

Notice that, in the previous section, we did a "semi" join to filter the questions, but we still don't have the list of tags associated with such questions. To achieve that, we will use one of the most surprisingly pleasant features of Polars: its list-column handling.



In [30]: tag_list_lazy = (  
   ...:   tags.lazy()  
   ...:   .groupby("Id").agg(  
   ...:     pl.col("Tag")  
   ...:     .list()  # Convert to a list of strings  
   ...:     .alias("TagList")  
   ...:   )  
   ...: )  
   ...: print(tag_list_lazy.limit(5).collect())  
shape: (5, 2)  
┌──────────┬─────────────────────────────────────┐  
 Id        TagList                               
 ---      ┆ ---                                   
 i64      ┆ list [str]                          │  
╞══════════╪═════════════════════════════════════╡  
 994990    ["spring"]                          │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
 29087440  ["android", "android-intent"]         
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
 12093870  ["asp.net", ".net", "sqldatasour... │  
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
│ 32889780 ┆ ["c", "extern", "function-declar...   
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤  
 22436290  ["mysql", "sql", ... "multiple-t... │  
└──────────┴─────────────────────────────────────┘


Enter fullscreen mode Exit fullscreen mode

After grouping by "Id" and turning each row into a list of tags, it's time to add a boolean column "ContainsPython" that signals whether any of the tags in the list contains the substring "python". For that', let's use the _.arr.eval_ context (also known as the List context):



tag_list_extended_lazy = tag_list_lazy.with_column(  
   pl.col("TagList")  
   .arr.eval(  
       pl.element()  
       .str.contains(r"(i?)python")  
       .any()  
   ).flatten().alias("ContainsPython")  
)


Enter fullscreen mode Exit fullscreen mode

The final join will provide the answer we are looking for:



top_python_questions = (  
   pl.scan_csv("/data/stacksample/Questions.csv", encoding="utf8-lossy")  
   .join(tag_list_extended_lazy, on="Id")  
   .filter(pl.col("ContainsPython"))  
   .sort("Score", reverse=True)  
).limit(1_000).collect()


Enter fullscreen mode Exit fullscreen mode

And the result:

Joining two dataframes in Polars

Joining two dataframes in Polars

Very neat!

Some differences with pandas

Similarly to what happens with Vaex, Polars DataFrames don't have an index. The user guide goes as far as saying this:

Indexes are not needed! Not having them makes things easier - convince us otherwise!

The discussion of this contentious stance will be the subject of a future blog post. In any case, this allows Polars to simplify indexing operations, since strings will always refer to column names, and numbers in the first axis will always refer to row numbers:



In [36]: print(df[0])  # First row  
shape: (1, 7)  
┌─────┬─────────────┬───────────────────┬────────────┬───────┬──────────────────┬──────────────────┐  
 Id  ┆ OwnerUserId  CreationDate      ┆ ClosedDate  Score  Title            ┆ Body               
 ---  ---          ---                ---        ┆ ---    ---              ┆ ---              │  
 i64  str          str                str        ┆ i64    str              ┆ str              │  
╞═════╪═════════════╪═══════════════════╪════════════╪═══════╪══════════════════╪══════════════════╡  
 80  ┆ 26          ┆ 2008-08-01T13:57:  NA          26    ┆ SQLStatement.exe  <p>I've written  │  
│     ┆             ┆ 07Z               ┆            ┆       ┆ cute() -         ┆ a database       │  
│     ┆             ┆                   ┆            ┆       ┆ multipl...       ┆ gener...         │  
└─────┴─────────────┴───────────────────┴────────────┴───────┴──────────────────┴──────────────────┘  

[37]: df[0, 0]  # First row, first column  
Out[37]: 80  

In [38]: df[0, "Id"]  # First row, column by name  
Out[38]: 80  

In [39]: df["Id"].head(5)  # Column by name  
Out[39]: shape: (5,)  
Series: 'Id' [i64]  
[  
80  
90  
120  
180  
260  
]


Enter fullscreen mode Exit fullscreen mode

On the other hand, even though indexing with boolean masks is supported in Polars as a way to bridge the gap with Pandas users, its use is discouraged in favor of select and filter, and "the functionality may be removed in the future". However, as you could see in the examples above, direct indexing is not needed as often as in pandas.

Should you use Polars?

Beyond this short introduction, Polars has much more to offer, from window functions and complex aggregations to time-series processing, and much more.

As a downside, since it is a young project and it's evolving quite fast, you will notice that some areas of the documentation are a bit lacking, or that there are no comprehensive release notes yet. Fortunately, Ritchie Vink, the Polars creator and current maintainer, quickly answers Stack Overflow questions and GitHub issues, and releases with bug fixes and new features are frequent.

On the other hand, if you are looking for an ultimate solution for your larger-than-RAM datasets, Polars might not be for you. Its lazy processing capabilities can take you quite far, but at some point you will have to confront the fact that Polars is an in-memory dataframe library, similar to pandas.

In summary:

  • Use Polars if you are willing to learn a different but powerful new API, if your data fits in memory, if your workflow involves lots of list-column manipulation, and in general if you want to explore a much faster alternative to pandas.
  • Don't use Polars if your data is much larger than RAM, if you are looking for solutions to quickly migrate a large pandas codebase, or if you are looking for an old, battle-tested library.
💖 💪 🙅 🚩
astrojuanlu
Juan Luis Cano Rodríguez

Posted on May 25, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related