An ETL Job using AWS Glue Studio to inner join DynamoDB tables, Apply Queries and Store the result in S3

nandinirajaram

Nandini Rajaram

Posted on March 23, 2023

An ETL Job using AWS Glue Studio to inner join DynamoDB tables, Apply Queries and Store the result in S3

Image description

ETL Jobs

ETL stands for extract, transform, and load. And this is the most common paradigm for combining data from various systems into a single database, data store, or warehouse for legacy storage or analytics.

Extraction
Extraction is the process of retrieving data from one or more sources. Following data retrieval, ETL is to compute work that loads the data into a staging area and prepares it for the next phase.
Transformation
Transformation is the process of mapping, reformatting and other operations to prepare data for consumption.
Loading
Loading involves successfully inserting the transformed data into the target database, data store or data warehouse.

All of this work is processed in what the business intelligent developers call an ETL job.


AWS Glue

AWS Glue is the key service that integrates data over a data lake, data warehouse, and purpose-built data stores which simplifies data movements from multiple sources

AWS Glue Studio

AWS Glue Studio is an easy-to-use graphical interface for creating, running, and monitoring AWS Glue extract, transform, and load (ETL) jobs.


Image description

Olive wants to purchase a finger ring with her birth stone.

The details of Birth stones and ring prices are stored in two DynamoDB tables

Now, We are going to create an ETL job using AWS Glue Studio which inner join two DynamoDB tables Birthstone and RingsTable on condition RingsTable.RingStoneName = Birthstone.Stone Name

After that, apply a Query on the result table to find the ring stone and its price matching to Olive's Birth Month and then save the result to the s3 bucket named "gemstonejobresult"

We use git for the source code management so finally we are going to create a remote git repository on AWS Code Commit and push the job to the git repository

Prerequisites

  • Two tables created using AWS DynamoDB Service: Birthstone, RingsTable

(Recipe: https://dev.to/aws-builders/cook-a-recipe-with-aws-dynamo-db-table-1kn1)

  • S3 Bucket to store the result - gemstonejobresult
  • A git repository on AWS Code commit - gemstone_job_repository
  • An IAM role named AWSGlueServiceRoledemo- with permission to your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job.

Lets set up the Prerequisites first

1 : DynamoDB Tables

  • Here, I have created two Tables named "Birthstone" and "Birthstone" and some items to them

  • Make sure to enable PITR on both the tables

The Table schema details provided below

Image description

Table Data

Image description

Image description


2: S3 Bucket for Storing the Query Result

  • Navigate to S3 console

Image description

  • Click Create Bucket button

  • You will be navigated to bucket creation page

  • Specify a unique name to the bucket, Leave the rest of the fields to default values

Image description

  • Scroll down and click create bucket button

Image description

  • You will be navigated back to bucket list page on S3 Console and the newly created bucket will be listed here

3: A git repository on AWS Code commit

AWS Code commit

  • AWS CodeCommit is a fully-managed source control service that makes it easy for companies to host secure and highly scalable private Git repositories.
  • CodeCommit eliminates the need to operate your own source control system or worry about scaling its infrastructure.

Repository

  • In CodeCommit, the fundamental version control object is a repository.
  • It's where you keep your project's code and files safe. It also keeps track of your project's history, from the first commit to the most recent changes.
  • You can collaborate on a project by sharing your repository with other users.

Before you can push changes to a CodeCommit repository, you must configure an IAM user in your Amazon Web Services account, or set up access for federated access or temporary credentials.
DO NOT USE ROOT USER

  • Navigate to code commit console

Image description

  • Click Create Repository button to create a git repository for storing our code

Image description

  • Specify the repository name, I am naming it as "gemstone_job_repository" and click create repository button

Image description

  • Lets Create a Readme.md file and commit it to the repository, Click Create file button

  • Add some data to the file and save the file as ReadMe.md

  • The file will be saved to the main branch

  • Specify the Author name,Email address and commit message (Though optional, its a good practice to provide commit message)

  • Click the button commit changes

Image description

Image description

Branches in git
Branches can be used to develop new features, save a specific version of your project from a specific commit, and much more.

When you create your first commit in aws code commit console , a default branch main is created for you

Now, we have created a repository on aws console , made a first commit to it by saving a ReadMe file to the main branch

4: IAM role

  • Navigate to IAM Console,

Image description

  • Click on Roles link on Left Pane

Image description

  • Click Create Role button to create a role

Image description

  • Select AWS Service as Trusted entity type
  • Select glue option from Use cases for other services dropdown

Image description

  • Add the below permissions to the role and click Next button

Image description

  • Specify a role name "AWSGlueServiceRoledemo" and Click Create Role button

Create an ETL job using AWS Glue Studio to inner join DynamoDB Tables

  • Navigate to AWS Glue Studio from aws console

Image description

  • Click Jobs from Left pane

Image description

  • You will be navigated to Jobs page, where you can create ETL Jobs
  • By default, Visual with source and target option would be selected

  • Select Source as DynamoDB and Target as Amazon S3 in Create job

  • Click Create button

Image description

  • An untitled job with the source as DynamoDB and target as Amazon S3 will be created for you. And You will be navigated to the page of untitled job where you will be able to edit the configuratons

Visual Tab

  • First You will be on the Visual tab

Customers can create data integration jobs using the Amazon Glue API from a JSON object that displays a visual step-by-step workflow thanks to the API offered by AWS Glue.
Clients can then work with these jobs using the visual editor in Amazon Glue Studio.

  • Now we are going to create a job which reads data from two dynamodb tables BirthStone table and Ring table.

  • We are going to inner join these two tables based on name attribute
    and apply a query to find birtstone suitable for Olive and its price

  • First we will add the required Source blocks, Action blocks and Target Blocks

  • Store the Query result in S3 bucket

    • Nodes Required:
      • Two DataSource DynamoDB (RingsTable, Birthstone)
      • Transfrom Join (For applying inner join)
      • SQL Query ( For querying the result table)
      • Data target - S3 Bucket
  • Click on the Source and add one more DynamoDB table

  • Remove the ApplyMapping Block, if its already there, since we don't need it for this job

    (Click on the ApplyMapping Block and Click Remove)

  • From the Action, Select Join and Add it

  • From the Action, Select SQL Query and Add it

  • S3 bucket will be there on the flow as target , leave it as it is

Image description

Now, Lets Configure the flow

Source Nodes

  • Click on One of the Data Source DynamoDB nodes,
  • On the Right pane, configure the node properties as shown in the picture below

Image description

  • Select the option Choose the Dynamodb Table directly radio button for DynamoDB source

  • Select the Birthstone table from the Dropdown, Make sure PITR is enabled for the Birth stone table , if not the table will not be populated on the dropdown, (If you do not see the table in the dropdown, Go to the Dynamodb service, enable PITR for the Birthstone table and Ring Table. Then Refresh this page to get the tables on this dropdown)

  • Click on the second DynamoTable Data source node,

  • On the Right pane, configure the node properties as shown in the picture below

Image description

  • Now, We have configured the Source Nodes successfully

Action Nodes

-

Image description

  • Click on the Transform join node
  • On the Right pane, Click on the Node Properties Tab
  • Select the two DynamoDB tables as Node Parents

Image description

  • Click on the Transform Tab, Make sure Inner join is selected as Join Type, Click Add Condition button

Image description

  • Specify the inner join condition as shown in the image below

Image description

  • Click on Transform SQL Query node
  • On the Right pane, Click on the Node Properties Tab
  • Select Transform Check box from the Node Parents dropdown list, You can see Join node getting selected

Image description

  • Click on the Transform Tab to specify the query on the Inner joined Table

  • As You see here, Inner Joined table is alised as myDataSource

  • Now Lets Write the SQL Query to generate the Birth Stone name and Price of the Birst stone matching to olive's Birth Month

Image description

  • Olive's Birth Month is December since she is born on a christmas day , Lets Create a query to get the Birth Stone gem and its price matching to the month

select RingStoneName, Price from myDataSource where BirthMonth = 'December' limit 1

Image description

Yes, Now we have configured Transform nodes successfully

Target Node

  • We need to store the query result in the s3 bucket

Image description

  • Click on the Data target - S3 bucket node

  • On the Right pane, Click on the Node Properties Tab and Tick SQL Query checkbox

Image description

  • Click on Data Target Properies - S3 Tab
  • Configure it as shown in the below image

Image description

  • As you see, Format should be Parquet
  • The bucket location should be selected from s3 so that a Parquet file with the result data will be saved on the S3 bucket once the job execution is completed successfully

  • After all the configurations, The Visual Tab will be seen as below image

Image description


Script Tab

  • Based on the Job Configuration, Python script will be generated on the Script tab

Image description

Job details

Now, Navigate to Job details tab to specify the job details

By default, Job name would be Untitled, Specify a job name in the Name Field.
Specify a description as well

Select the AWSGlueServiceRole-demo which we already created as a part of prerequisites

Leave the remaining to default settings

Click Save button to save the Job

Image description


Version Control Tab

Navigate to Version Control Tab

Here, You need to do the git configurations
Select AWS Code commit option for git service

Image description

In the Repository Configuration section, choose the git repository that we created in aws code commit (Refer Prerequisites)

Choose the branch as main

Image description

Click Save button on the top right

Click the Actions Button, Choose Push to repository option

The job will be pushed to the repository and a JSON file will be saved on to the code repository.

Image description

Image description

Navigate to aws code commit and verify if files are saved on to the repository

Image description


Executing the ETL Job and Verifying the result in S3 bucket

Once you have saved the job, its time to Run the job

  • Click the Run button on the Top Right pane

Image description

  • You will be getting the following message

Image description

  • Navigate to Runs Tab to see the job status

Image description

  • Once the Run status becomes succeeded, Navigate to S3 Console and click on the gemstonejobresult bucket

  • You will be able to see the result saved as a parquet file

Image description

  • Download the Parquet file and save the file locally by clicking the Download button

Image description

  • Now You will be able to view the result

Image description

Reading the Parquet File object using S3 select

Alternatively, You can Query the Parquet file with S3 select

  • Navigate to S3 Console and click on the gemstonejobresult bucket

  • Click on the Parquet file name link where the result is saved

Image description

  • Now, Click the Object Actions button and from the list, select the Query With S3 Select and give a click

  • You will be navigated to Query with S3 Select Page, Now by default Apache Parquet will be chosen in input settings, Keep it as it is

  • You can choose the Format in Output settings.

S3 Select allows you to specify how the query results are serialized in the output. You can choose formats like JSON or CSV. This determines how the data will be structured in the output files.

  • Lets choose JSON here so that the result of the query will be displayed in JSON Format

Image description

  • We are going to read the Parquet File using the below Select Query which is displayed in the query editor by default.

SELECT * FROM s3object s LIMIT 5

  • Now Click Run SQL query button to execute the query

Image description

  • Once the Query is executed successfully, You will be able to view the result in JSON format (The format chosen in Output settings)

Image description

  • You can download the result by clicking the Download results button

Yaay , We found it , Olive needs to purchase a Turquoise ring and its price is 612$

Image description

And that was Olive’s story.


Challenge Task

  • Set up an ETL job by following the above instructions

  • Find out the month corresponding to the Birth stone which is priced highest

  • Post a screenshot of the answer in comment section

💖 💪 🙅 🚩
nandinirajaram
Nandini Rajaram

Posted on March 23, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related