Benchmarking Python Processing Engines: Who’s the Fastest?

bruno-nirello

Bruno Nirello

Posted on February 8, 2024

Benchmarking Python Processing Engines: Who’s the Fastest?

In the dynamic landscape of data engineering, two tools have recently caught attention: DuckDB and Polars. DuckDB impresses with its unique blend of traditional and contemporary database features, while Polars emerges as a powerhouse for data processing. This post aims to benchmark these contenders, evaluating their speed, efficiency, and user-friendliness. Let's dive in.

The Contenders

DuckDB (0.9.0): An in-memory analytical database written in C++.
Polars (0.19.6): An ultra-fast DataFrame library implemented in Rust, designed to provide lightning-fast operations.

Note: While other frameworks such as Pandas, Dask, Spark, and Vaex are extremely powerful and have their own unique use cases, this post focuses on benchmarking DuckDB and Polars due to their particular emphasis on speed performance in certain contexts.

Benchmark Setup

  • Source: I have employed the official Polars benchmark repository for this evaluation.

  • Benchmark Nature: The benchmarks consist of TPC-standardized queries. These are specifically curated to assess the performance of practical, real-world workflows. On the official Polars website, a detailed result of 8 such queries is provided.

  • Query Types: This benchmark incorporates 22 unique queries (q1,q2,etc). These range from joins to aggregations and sorts, all of which are recognized for their computational intensity.

  • Testing Environment: The tests were conducted on a machine equipped with a 16-core AMD vCPU and 32GB of RAM. All codes were executed using Python 3.10.

Data Size

Data was generated by repository code using scale10, here are the size of each entity.

Image description

Data Transformation & Querying

The benchmark uses two methods to measure query speed. The first, "Read Parquet File", utilizes the library's efficient data processing directly from the file. The second method loads the file into memory and then measures the query processing time.

Image descriptionQuery direcly to the file.

All queries again…

Image descriptionQuery after all data was in memory

In q1, q9, q13 and q17, the combination of multiple joins, string-based filtering, and complex aggregation could be challenging for polars to optimize as effectively as duckdb.

q21 stands out due to its operations on counting unique values, filtering based on these counts, and a series of joins thereafter. Only by delving deeper into the internal mechanics of both tools can one truly discern the reasons for such variances.

Overall, DuckDB appears faster in both scenarios, but that doesn't tell the whole story.

Image description

The process of loading data into memory incurs time and memory costs. By integrating the Linux time utility into the Makefile, we can accurately measure these costs.

/usr/bin/time -v make run_duckdb
/usr/bin/time -v make run_polars
Enter fullscreen mode Exit fullscreen mode

Image description

DuckDB demonstrates faster performance and lower memory usage when reading a file directly compared to Polars. It is suggested that Polars may have used swap memory, indicated by the red metric. It's important to note that these libraries are not designed to scale out across multiple machines, emphasizing the need for efficient CPU core utilization.
Polars shows competitive or even better performance in specific areas like Disk IO during direct file reads and RAM IO during in-memory operations. In a system with less disk IOPS, Polars can perform better.

¹Higher is better in Percentage of CPU. Values Higher than 100% indicates that it's using multi-core processing.

How to do it?

DuckDB "Read Parquet File"

import duckdb
conn = duckdb.connect(database=':memory:')
df_count = conn.sql("""
  SELECT
      count(*) as count_order
  FROM
      'lineitem.parquet'
"""
).fetchdf()
print(df_count)
Enter fullscreen mode Exit fullscreen mode

DuckDB "In Memory"

import duckdb
conn = duckdb.connect(database=':memory:')
conn.sql("""
  CREATE TEMP TABLE IF NOT EXISTS lineitem AS
  SELECT *
  FROM read_parquet('lineitem.parquet');
"""
)
df_count = conn.sql("""
  SELECT
      count(*) as count_order
  FROM
      lineitem
"""
).fetchdf()
print(df_count)
Enter fullscreen mode Exit fullscreen mode

Polars "Read Parquet File"

import polars as pl
df = pl.scan_parquet('lineitem.parquet')
df_count = df.select(
    pl.count().alias("count_order"),
).collect()
print(df_count)
Enter fullscreen mode Exit fullscreen mode

Polars "In Memory"

import polars as pl
df = pl.scan_parquet('lineitem.parquet')
df = df.collect().rechunk().lazy()
df_count = df.select(
    pl.count().alias("count_order"),
).collect()
print(df_count)
Enter fullscreen mode Exit fullscreen mode

In Conclusion:

The benchmark results presented here are based on the current versions of both libraries. As these tools continue to evolve, it's worth noting that performance metrics might shift with future updates.

DuckDB emerges as a promising contender in the landscape of Python processing engines. Its open-source nature provides a refreshing alternative to traditional OLAP databases driven by SAS vendors. When it comes to performance, DuckDB consistently outperforms some of its competitors, particularly in tasks involving joins and complex aggregations. Additionally, its simplicity and the prospect of cleaner, more readable code make it an attractive option for those familiar with libraries like Pandas and Spark (except Spark SQL).

However, it's essential to recognize that DuckDB is still in its nascent stages. This early phase of development means users might occasionally encounter challenges in the form of bugs or missing features.

If you believe there's an aspect I might have overlooked, please leave a comment below!

💖 💪 🙅 🚩
bruno-nirello
Bruno Nirello

Posted on February 8, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related