Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)

tmyoda

Tomoya Oda

Posted on July 16, 2023

Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)

This is a continuation of my previous posts as follows.

  1. Spark on AWS Glue: Performance Tuning 1 (CSV vs Parquet)

  2. Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame)

  3. Spark on AWS Glue: Performance Tuning 3 (Impact of Partition Quantity)

  4. Spark on AWS Glue: Performance Tuning 4 (Spark Join)

  5. Spark on AWS Glue: Performance Tuning 5 (Using Cache)

Glue DynamicFrame vs Spark DataFrame

Let's compare them using the Parquet file which I created in the part 1.

Data Read Speed Comparison

We will read a single large Parquet file and a highly partitioned Parquet file.

with timer('df'):
    dyf = glueContext.create_dynamic_frame.from_options(
        "s3",
        {
            "paths": [
                "s3://.../parquet-chunk-high/"
            ]
        },
        "parquet",
    )
    print(dyf.count())

with timer('df partition'): 
    dyf = glueContext.create_dynamic_frame.from_options(
        "s3",
        {
            "paths": [
                "s3:/.../parquet-partition-high/"
            ]
        },
        "parquet",
    )
    print(dyf.count())
Enter fullscreen mode Exit fullscreen mode
324917265
[df] done in 125.9965 s
324917265
[df partition] done in 55.9798 s
Enter fullscreen mode Exit fullscreen mode

DynamicFrame is too slow...

Summary

  • Based on the part 1 (Reading Speed Comparison), spark.read is 27.1 s (for single large file) and 36.3 s (for highly partitioned file), so DynamicFrame is quite slow.
  • Interestingly, the speed of reading partitioned data is faster than single large Parquet file.
💖 💪 🙅 🚩
tmyoda
Tomoya Oda

Posted on July 16, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related