Develop your AWS Glue Jobs Locally using Jupyter Notebook

criscarba

criscarba

Posted on July 27, 2022

Develop your AWS Glue Jobs Locally using Jupyter Notebook

This post is mainly intended for professionals who are Data Engineers and use AWS as a cloud provider. It will be covered how to create a local experimental environment step by step.

As you well know, AWS offers multiple data oriented services, where AWS Glue stands out as a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage.

AWS Glue is designed to work with semi-structured data. It introduces a component called a dynamic frame, which you can use in your ETL scripts. A dynamic frame is similar to an Apache Spark dataframe, which is a data abstraction used to organize data into rows and columns, except that each record is self-describing so no schema is required initially. With dynamic frames, you get schema flexibility and a set of advanced transformations specifically designed for dynamic frames. You can convert between dynamic frames and Spark dataframes, so that you can take advantage of both AWS Glue and Spark transformations to do the kinds of analysis that you want.

What is Jupyter Notebook?

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

jupyter

How can we take advantage of Jupyter Notebook? Basically inside a Jupyter notebook we can perform all the necessary experimentation of our pipeline (transformations, aggregation, cleansing, enrichment, etc.) and then export it in a Python script (.py) for use in AWS Glue.

Let’s Get Started!

1)Install Anaconda environment with Python 3.x

NOTE: I recommend to use Python 3.7

** 2) Install Apache Maven**

zip

  • Create the MAVEN_HOME System Variable (Windows =>Edit The system Environment variables =>Environment Variables). Follow the instructions below:

maven

maven

  • Modify the PATH Environment Variable in order to make the MAVEN_HOME variable visible:

path

3) Install Java 8 Version

java

java2

  • Create the JAVA_HOME Environment Variable and make sure to add it into the PATH variable. (Similar process than MAVEN_HOME)

pathvar

4) Install the SPARK distribution from the following location based on the glue version:

spark

  • Create the SPARK_HOME environment Variable and Add it into the PATH Variable. (Similar process than MAVEN_HOME)

sparkhome

5) Download the Hadoop Binaries

NOTE: Make sure that the “winutils.exe” file is within the “bin” folder of the Hadoop directory

6) Install Python 3.7 in your Anaconda virtual environment

  • Open an ANACONDA PROMT and Execute the command conda install python=3.7

Anaconda

NOTE: This Process will take ~30 min

7) Install “awsglue-local” in your Anaconda virtual environment

  • Open an ANACONDA PROMT and run the command pip install awsglue-local

ana

*8) Download the Pre_Build_Glue_Jar dependencies (REQUIRED FOR CREATING THE INSTANCE OF SPARK SESSION)
*

sp

*9) Confirm that you have installed everything successfully
*

Open a new Anaconda Prompt and Execute the following commands:

conda list awsglue-local

java -version

mvn -version

pyspark

glue

javav

mvnv

sparkver

10) Once everything is completed, open a Jupyter notebook.

  • Open a new ANACONDA PROMPT and run the command “PIP INSTALL FINDSPARK” and wait until its completed. Once its completed close the Anaconda prompt. This is only required in most cases for the first time.

  • Re-Open Anaconda Prompt and run the command “jupyter-lab” in order to open a Jupyter notebook.

consola

  • Create a Jupyter Notebook and execute the following commands (THIS IS ONE TIME ONLY)

oto

import findspark
findspark.init()
import pyspark

You won’t need to execute this code again since this is a typical step for the initial installation of spark. The Findspark library generates somes references into the local machine to link the pyspark library with the bin files.

I hope the content is useful for everyone. Thanks a lot!
Cheers!

Cristian Carballo
cristian.carballo3@gmail.com

LinkedIn Profile

💖 💪 🙅 🚩
criscarba
criscarba

Posted on July 27, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related