AWS Cloud9 for Data Engineers

gauravthalpati

Gaurav Thalpati

Posted on March 17, 2023

AWS Cloud9 for Data Engineers

This article was originally posted on my substack. Sharing it here with fellow community builders.

I usually do multiple quick PoCs for my day-to-day analysis and RnD work. I often have to install various software, applications, databases, and tools for these. I’ve been using dockers by installing docker desktop on my windows laptop. I have an 8GB RAM laptop which is not the best for this kind of work. That’s why I’ve shifted to AWS Cloud9. It’s an AWS service that can help you to perform your PoC work quickly.

Here is a quick guide on using Cloud9 for Data Engineering PoCs.

What is Cloud9?

AWS Cloud9 is a cloud-based IDE for development work. It is powered by EC2 machine, and its size can be selected based on the workload you want to execute.

It provides IDE to write, execute and debug code and supports Python, JavaScript, and many other languages. The best thing is that it integrates with AWS services like S3, and you can easily download and upload files from/to S3 from Cloud9. It also supports collaborative development and a chat facility with other developers.

AWS Cloud9 is not a “data” specific service and is not much discussed within the data community. But it is one of the best services that can help you to make DE work much easier and quicker.

Why you should use Cloud9 for DE PoCs?

  • Single interface to perform various activities like creating code, running bash commands, transferring files to S3, running AWS CLI commands, and pushing code to git.
  • Easy to install new tools using dockers.
  • Provision EC2 instance as per your need. No need to worry about powerful laptops with 16GB+ RAM. ( I generally use m5.xlarge with 16GB RAM)
  • Start and Stop without losing your installed software. Pay only when you are using it.
  • All good features of EC2 + simplicity of doing all things in one place

Below is a list of some of the DE activities that Cloud9 can be used for

Use Case #1 | Editing S3 files quickly

Scenario: You want to create and upload some dummy data to S3.

You can easily create a new file in Cloud9 and upload it in just a couple of clicks to your S3 bucket.

If you want to add more columns to this file or add more records, you can download the file, make changes and upload it back - without leaving your Cloud9 terminal.

Supports multiple AWS Services along with S3
Supports multiple AWS Services along with S3

Browse the folder where you want to upload the file
Browse the folder where you want to upload the file

Upload the file from Cloud9 to S3
Upload the file from Cloud9 to S3

You can also execute simple shell commands to make changes to files. If you love running sed or awk one-liners, you can definitely try it out!

awk command - my all-time favorite!
awk command - my all-time favorite!

Use Case #2 | Running AWS CLI commands

You can execute the AWS CLI commands directly from the Cloud9 console without adding any credentials.

AWS CLI is preinstalled on the Amazon Linux 2 machine.

Validate AWS CLI version
Validate AWS CLI version

Scenario: You want to check the IAM users in your account.

You can execute the AWS CLI commands for the IAM service.

List users using IAM CLI
List users in this account

Use Case #3 | Creating Python Scripts

If you want to create quick Python scripts for your DE work, you don’t need to open PyCharm or other editors. You can simply do it in Cloud9 itself.

Change the “Text” to “Python” to switch to Python Compiler
Change the “Text” to “Python” to switch to Python Compiler.

Save the file with .py extension and execute it in the console
Save the file with .py extension and execute it in the console itself.<br>

Use Case #4 | Running dockers

Scenario: You want to run Spark quickly and try out some simple commands for learning purposes. There are many options to use - Glue, EMR, Databricks. One of the easiest ways is to run Spark on docker using Cloud9

docker is pre-installed if you have selected the Amazon Linux 2 machines while creating the Cloud9 instance

To confirm if docker is installed, execute the below command.

Validate that docker is installed
Validate that docker is installed.

Now you can pull Spark (Python) from the docker hub and start the shell using the commands below.

docker pull apache/spark-py
docker run -it apache/spark-py /opt/spark/bin/pyspark

Run the docker, and you are ready with the Spark Shell
Run the docker, and you are ready with the Spark Shell.

You can follow the same approach for running other tools like Kafka, MySql, and many others.

This Cloud9 instance does not come with docker-compose, which might be required for other software.

For installing docker-compose, you can execute the below commands

sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose version //Validate if its working

Note: _For Amazon Linux2, you need to install docker-compose-linux-x86_64_

Use Case #5 | Uploading code to git

And finally, when all your work is done, and you want to save your work for future reference, it can be easily uploaded to git. Cloud9 has easy integration with git, and you can quickly pull and push your code to git repos.

Scenario: You want to push the python code you created earlier to your git repo

Configure git from the left-hand pane using the “Source Control” option. For the first time, clone the repo by providing the repo link. It will identify the changes and mark them accordingly. You can also use manual commands like add, commit, and push.

Configure the git repo in the source control
Configure the git repo in the source control.

Commit the changes, and add the appropriate message
Commit the changes, and add the appropriate message.

Push the change using manual commands or from the UI
Push the change using manual commands or from the UI

Note: You will have to provide your git user name and personal token when pushing the new changes to your git repo

Validate the changes in your git repo.

Confirm the new files are added to your repo
Confirm the new files are added to your repo.

Note: Once you finish your work, close the Cloud9 window; otherwise, the instance will keep running. You can also go to EC2 services on the console and directly stop the Cloud9 instance to save some $

These are just a few use cases of Cloud9 for DE work. You can explore and leverage other features for your day-to-day RnD work, PoCs, learning, and training activities.

And Cloud9 is not just for doing PoCs or educational work. You can also use it in your actual projects. It can help in collaborative coding, chatting with fellow developers, and many more cool features.

You can try these out, and if you have any comments/suggestions/questions, please let me know.

💖 💪 🙅 🚩
gauravthalpati
Gaurav Thalpati

Posted on March 17, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

AWS DATA ENGINEER - 101
aws AWS DATA ENGINEER - 101

October 24, 2024

Top 5 Modern ETL Tools from AWS
dataengineering Top 5 Modern ETL Tools from AWS

December 19, 2023

Amazon Kinesis Firehose
dataengineering Amazon Kinesis Firehose

November 1, 2023