Thiago Panini
Posted on March 20, 2023
Have you ever thought about having a bunch of Spark features and code blocks to improve once at all your journey on developing Spark applications in AWS services like Glue and EMR? In this article I'll introduce you sparksnake
as a powerful Python package as a game changing on Spark application development on AWS.
The idea behind sparksnake
To understand the main reasons for bringing sparksnake
to life, let's first take a quick look on a Glue boilerplate code presented wherever a new job is created on AWS console:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
Now let me show you two simple perspectives from Glue users in different levels.
Beginner: it's reasonable to say that the block code above isn't something we see everyday outside Glue, right? So, for people who are trying Glue for the first time, there would be questions about elements like GlueContext, Job, and special methods.
Experienced Developer: even for this group, the "Glue setup" could be something painful (specially if you need to do that every single time when starting a new job development).
Therefore, the main idea behind sparksnake
is to take every common step in a Spark application developed using AWS services and encapsulate it on classes and methods that can be called by users using a single line of code. In other words, all the boilerplate code shown above would be replaced in sparksnake
as:
# Importing sparksnake's main class
from sparksnake.manager import SparkETLManager
# Initializing a glue job
spark_manager = SparkETLManager(mode="glue")
spark_manager.init_job()
This is just one of a series of features available in sparksnake
! The great thing about it is the ability to call methods and functions to use Spark common features in jobs whether you're running them on AWS services like Glue and EMR or locally.
The library structure
After a quick overview on sparksnake
, it's important to know a little bit more on how the library is structured under the hood.
By this time, there are two modules on the package:
-
manager
: central module who hosts theSparkETLManager
class with common Spark features. It inherits features from other classes based on an operation mode chosen by the user -
glue
: side module who hosts theGlueJobManager
class with special features used in Glue jobs
In a common usage pattern, users import the SparkETLManager
class and choose a operation mode according to where the Spark application will be developed and deployed. This operation mode guides the SparkETLManager
class to inherit features from AWS services like Glue and EMR to provide users a custom experience.
Features
Now that you know more about the main concepts about sparksnake, let's summarize some of its features:
š¤ Enhanced development experience of Spark Applications to be deployed as jobs in AWS services like Glue and EMR
š Possibility to use common Spark operations for improving ETL steps using custom classes and methods
āļø No need to think too much into the hard and complex service setup (e.g. with sparksnake you can have all elements for a Glue Job on AWS with a single line of code)
šļøāšØļø Application observability improvement with detailed log messages in CloudWatch
š ļø Exception handling already embedded in library methods
A quickstart
To start using sparksnake
, just install it using pip:
pip install sparksnake
Now let's say, for instance, that we are developing a new Glue job on AWS and we want to use sparksnake
to make things easier. In order to provide a useful example about how powerful the library can be, imagine we have a series of data sources to be read into the job. There would be very painful to write multiple lines of code for reading each data source from catalog.
With sparksnake
, we can read multiple data sources from catalog using a single line of code:
# Generating a dictionary of Spark DataFrames from catalog
dfs_dict = spark_manager.generate_dataframes_dict()
# Indexing to get individual DataFrames
df_orders = dfs_dict["orders"]
df_customers = dfs_dict["customers"]
And what about writing data on S3 and cataloging it on Data Catalog? No worries, that could be done with a single line of code too:
# Writing data on S3 and cataloging on Data Catalog
spark_manager.write_and_catalog_data(
df=df_orders,
s3_table_uri="s3://bucket-name/table-name",
output_database_name="db-name",
output_table_name="table-name",
partition_name="partition-name",
output_data_format="data-format" # e.g. "parquet"
)
Once again, those are only two examples of a series of features already available on the library and this article was written to present all users a different way to learn and to improve skills on Spark applications inside AWS.
Learn more
There are some useful links and documentations about sparksnake. Check it out on:
Posted on March 20, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.