Improving ETL jobs on AWS with sparksnake

Have you ever thought about having a bunch of Spark features and code blocks to improve once at all your journey on developing Spark applications in AWS services like Glue and EMR? In this article I'll introduce you sparksnake as a powerful Python package as a game changing on Spark application development on AWS.

The idea behind sparksnake

To understand the main reasons for bringing sparksnake to life, let's first take a quick look on a Glue boilerplate code presented wherever a new job is created on AWS console:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

Now let me show you two simple perspectives from Glue users in different levels.

Beginner: it's reasonable to say that the block code above isn't something we see everyday outside Glue, right? So, for people who are trying Glue for the first time, there would be questions about elements like GlueContext, Job, and special methods.
Experienced Developer: even for this group, the "Glue setup" could be something painful (specially if you need to do that every single time when starting a new job development).

Therefore, the main idea behind sparksnake is to take every common step in a Spark application developed using AWS services and encapsulate it on classes and methods that can be called by users using a single line of code. In other words, all the boilerplate code shown above would be replaced in sparksnake as:

# Importing sparksnake's main class
from sparksnake.manager import SparkETLManager

# Initializing a glue job
spark_manager = SparkETLManager(mode="glue")
spark_manager.init_job()

This is just one of a series of features available in sparksnake! The great thing about it is the ability to call methods and functions to use Spark common features in jobs whether you're running them on AWS services like Glue and EMR or locally.

The library structure

After a quick overview on sparksnake, it's important to know a little bit more on how the library is structured under the hood.

By this time, there are two modules on the package:

manager: central module who hosts the SparkETLManager class with common Spark features. It inherits features from other classes based on an operation mode chosen by the user
glue: side module who hosts the GlueJobManager class with special features used in Glue jobs

In a common usage pattern, users import the SparkETLManager class and choose a operation mode according to where the Spark application will be developed and deployed. This operation mode guides the SparkETLManager class to inherit features from AWS services like Glue and EMR to provide users a custom experience.

Features

Now that you know more about the main concepts about sparksnake, let's summarize some of its features:

🤖 Enhanced development experience of Spark Applications to be deployed as jobs in AWS services like Glue and EMR
🌟 Possibility to use common Spark operations for improving ETL steps using custom classes and methods
⚙️ No need to think too much into the hard and complex service setup (e.g. with sparksnake you can have all elements for a Glue Job on AWS with a single line of code)
👁️‍🗨️ Application observability improvement with detailed log messages in CloudWatch
🛠️ Exception handling already embedded in library methods

A quickstart

To start using sparksnake, just install it using pip:

pip install sparksnake

Now let's say, for instance, that we are developing a new Glue job on AWS and we want to use sparksnake to make things easier. In order to provide a useful example about how powerful the library can be, imagine we have a series of data sources to be read into the job. There would be very painful to write multiple lines of code for reading each data source from catalog.

With sparksnake, we can read multiple data sources from catalog using a single line of code:

# Generating a dictionary of Spark DataFrames from catalog
dfs_dict = spark_manager.generate_dataframes_dict()

# Indexing to get individual DataFrames
df_orders = dfs_dict["orders"]
df_customers = dfs_dict["customers"]

And what about writing data on S3 and cataloging it on Data Catalog? No worries, that could be done with a single line of code too:

# Writing data on S3 and cataloging on Data Catalog
spark_manager.write_and_catalog_data(
    df=df_orders,
    s3_table_uri="s3://bucket-name/table-name",
    output_database_name="db-name",
    output_table_name="table-name",
    partition_name="partition-name",
    output_data_format="data-format" # e.g. "parquet"
)

Once again, those are only two examples of a series of features already available on the library and this article was written to present all users a different way to learn and to improve skills on Spark applications inside AWS.

Learn more

There are some useful links and documentations about sparksnake. Check it out on:

Blog

Improving ETL jobs on AWS with sparksnake

Thiago Panini

The idea behind sparksnake

The library structure

Features

A quickstart

Learn more

Join Our Newsletter. No Spam, Only the good stuff.

Related