Benchmarking Python Processing Engines: Who’s the Fastest?
Bruno Nirello
Posted on February 8, 2024
In the dynamic landscape of data engineering, two tools have recently caught attention: DuckDB and Polars. DuckDB impresses with its unique blend of traditional and contemporary database features, while Polars emerges as a powerhouse for data processing. This post aims to benchmark these contenders, evaluating their speed, efficiency, and user-friendliness. Let's dive in.
The Contenders
DuckDB (0.9.0): An in-memory analytical database written in C++.
Polars (0.19.6): An ultra-fast DataFrame library implemented in Rust, designed to provide lightning-fast operations.
Note: While other frameworks such as Pandas, Dask, Spark, and Vaex are extremely powerful and have their own unique use cases, this post focuses on benchmarking DuckDB and Polars due to their particular emphasis on speed performance in certain contexts.
Benchmark Setup
Source: I have employed the official Polars benchmark repository for this evaluation.
Benchmark Nature: The benchmarks consist of TPC-standardized queries. These are specifically curated to assess the performance of practical, real-world workflows. On the official Polars website, a detailed result of 8 such queries is provided.
Query Types: This benchmark incorporates 22 unique queries (q1,q2,etc). These range from joins to aggregations and sorts, all of which are recognized for their computational intensity.
Testing Environment: The tests were conducted on a machine equipped with a 16-core AMD vCPU and 32GB of RAM. All codes were executed using Python 3.10.
Data Size
Data was generated by repository code using scale10, here are the size of each entity.
Data Transformation & Querying
The benchmark uses two methods to measure query speed. The first, "Read Parquet File", utilizes the library's efficient data processing directly from the file. The second method loads the file into memory and then measures the query processing time.
All queries again…
Query after all data was in memory
In q1
, q9
, q13
and q17
, the combination of multiple joins, string-based filtering, and complex aggregation could be challenging for polars
to optimize as effectively as duckdb
.
q21
stands out due to its operations on counting unique values, filtering based on these counts, and a series of joins thereafter. Only by delving deeper into the internal mechanics of both tools can one truly discern the reasons for such variances.
Overall, DuckDB appears faster in both scenarios, but that doesn't tell the whole story.
The process of loading data into memory incurs time and memory costs. By integrating the Linux time
utility into the Makefile, we can accurately measure these costs.
/usr/bin/time -v make run_duckdb
/usr/bin/time -v make run_polars
DuckDB demonstrates faster performance and lower memory usage when reading a file directly compared to Polars. It is suggested that Polars may have used swap memory, indicated by the red metric. It's important to note that these libraries are not designed to scale out across multiple machines, emphasizing the need for efficient CPU core utilization.
Polars shows competitive or even better performance in specific areas like Disk IO during direct file reads and RAM IO during in-memory operations. In a system with less disk IOPS, Polars can perform better.
¹Higher is better in Percentage of CPU. Values Higher than 100% indicates that it's using multi-core processing.
How to do it?
DuckDB "Read Parquet File"
import duckdb
conn = duckdb.connect(database=':memory:')
df_count = conn.sql("""
SELECT
count(*) as count_order
FROM
'lineitem.parquet'
"""
).fetchdf()
print(df_count)
DuckDB "In Memory"
import duckdb
conn = duckdb.connect(database=':memory:')
conn.sql("""
CREATE TEMP TABLE IF NOT EXISTS lineitem AS
SELECT *
FROM read_parquet('lineitem.parquet');
"""
)
df_count = conn.sql("""
SELECT
count(*) as count_order
FROM
lineitem
"""
).fetchdf()
print(df_count)
Polars "Read Parquet File"
import polars as pl
df = pl.scan_parquet('lineitem.parquet')
df_count = df.select(
pl.count().alias("count_order"),
).collect()
print(df_count)
Polars "In Memory"
import polars as pl
df = pl.scan_parquet('lineitem.parquet')
df = df.collect().rechunk().lazy()
df_count = df.select(
pl.count().alias("count_order"),
).collect()
print(df_count)
In Conclusion:
The benchmark results presented here are based on the current versions of both libraries. As these tools continue to evolve, it's worth noting that performance metrics might shift with future updates.
DuckDB emerges as a promising contender in the landscape of Python processing engines. Its open-source nature provides a refreshing alternative to traditional OLAP databases driven by SAS vendors. When it comes to performance, DuckDB consistently outperforms some of its competitors, particularly in tasks involving joins and complex aggregations. Additionally, its simplicity and the prospect of cleaner, more readable code make it an attractive option for those familiar with libraries like Pandas and Spark (except Spark SQL).
However, it's essential to recognize that DuckDB is still in its nascent stages. This early phase of development means users might occasionally encounter challenges in the form of bugs or missing features.
If you believe there's an aspect I might have overlooked, please leave a comment below!
Posted on February 8, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.