Installing Python Packages in AWS Glue using AWS CodeArtifact

klescosia

Kyle Escosia

Posted on November 10, 2023

Installing Python Packages in AWS Glue using AWS CodeArtifact

Background of the Problem

I spent quite sometime figuring out how to install Python Packages in AWS Glue inside a VPC without internet access and I managed to figure it out after some tinkering. Just to recall, AWS introduced the support for installation of Python Packages via --additional-python-modules option. While this is a lifesaver - for those who started working with Glue 1.0, it only works if your Glue Job can connect to the internet.

Given the emphasis on security, a number of customers chose to limit/restrict egress traffic from their VPC to the public internet and require a method to manage the packages used by their data pipelines.

This article focuses on that challenge. This is a step-by-step process on how to setup your Glue Job to connect to a pypi mirror via AWS CodeArtifact, allowing you to install packages in a Private Subnet. For this tutorial, it is recommended to have a working knowledge of basic stuffs (e.g. Networking, Services) on AWS. But, I'll try my best to explain each part.

Let's get started!

Solution Overview

kyle-escosia-aws-codeartifact-aws-glue-integration
Fig. 1. Architecture for the AWS CodeArtifact and AWS Glue Integration

The core of the solution is the AWS CodeArtifact, which allows you to use it as tool to securely store, publish, and share packages, in this case, PyPi packages, across your private network without directly connecting into the Public PyPi Repository. This is made possible by VPC Endpoints through PrivateLink connections.

You do need to create endpoints for S3 and CodeArtifact for this to work, or else, you'll get errors like Connection timed out errors.

Here's some resources to help you out with that:

Gateway endpoints for Amazon S3

Create VPC endpoints for CodeArtifact - if via console, kindly follow the same steps as with the S3 Endpoint.

What you will need

An AWS account, of course 

Note: Test this on your dev environment first

  1. AWS Glue
  2. AWS CodeArtifact
  3. Docker
  4. AWS Access Keys (with permissions on AWS CodeArtifact)

I won't go over these tools one by one as I believe ChatGPT can you give those definitions and its use better than me.

The Solution

In this section, I'll go over the step-by-step solution for each process.

Let's start by setting up our CodeArtifact Repository.

Setting up the AWS Codeartifact

Create a CodeArtifact Repository

kyle-escosia-codeartifact-home

kyle-escosia-codeartifact-home-creation

Fill up the details

  • Repository Name
  • Repository Details (Optional)
  • Public upstream repositories - I chose PyPi

Select the domain

kyle-escosia-codeartifact-domain

Specify your domain name

kyle-escosia-codeartifact-repo-list

You should have the following repositories after creation:

  • <your-repo>
  • pypi-store

Now that's done, you can inspect the created repositories. The pypi-store was automatically created. The <your-repo> is the one that we're interested in since this will contain our Python Packages.

With that, let's proceed with configuring your local environment.

Setting up your local environment

Step 1: Install Docker

Install here:
https://docs.docker.com/get-docker/

Step 2: Pull the Amazon Linux 2 Image

$ docker pull amazonlinux:latest
Enter fullscreen mode Exit fullscreen mode

Step 3: Run the container

Run the container and interact with the command line of the container using -it

$ docker run -it --rm -v /path/on/host:/path/in/container image_name /bin/bash
Enter fullscreen mode Exit fullscreen mode

Some notes:

  • -v /path/on/host:/path/in/container: This is the volume mount option. It mounts a directory from your host (/path/on/host) into the container (/path/in/container). Any changes made in the mounted directory inside the container will be reflected on the host directory and vice versa.

  • --rm: This tells Docker to automatically remove the container when it exits. This means that once you're done with the bash session and exit, the container will be cleaned up, and no container filesystem will be left on your host system. Feel free to remove this option if you do not want your container to behave like that.

Step 4: Install Python 3.10

$ wget https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tgz
$ tar -xf Python-3.10.0.tgz
$ cd Python-3.10.0
$ ./configure --enable-optimizations
$ sudo make altinstall
Enter fullscreen mode Exit fullscreen mode

Note that AWS Glue 4.0 runs Python 3.10 version. For others, kindly refer to the documentation.

Step 5: Install AWS CLI

Using pip

$ pip install awscli
Enter fullscreen mode Exit fullscreen mode

Using yum
https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#getting-started-install-instructions

Step 6: Configure AWS Credentials

Refer to this for creating your access keys:
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html

After getting the values for the access keys, configure your AWS CLI:

$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json
Enter fullscreen mode Exit fullscreen mode

Step 7: Connect to Repository

Go back to the AWS Console and click on your created repository.

kyle-escosia-codeartifact-my-code-repository

Click View connection instructions

kyle-escosia-codeartifact-connection-instructions

Copy and run the command in Step 3 of the Connection instructions

$ aws codeartifact login \
--tool pip \
--repository <your-repo-name> \
--domain <your-domain-name> \
--domain-owner <your-account-id> \
--region <your-region>
Enter fullscreen mode Exit fullscreen mode

Once successfully logged in, kindly note that any pip install command will be pushed to this repository instead of the Python environment on the Docker container.

Step 8: Install Python Packages

Install your packages!

kyle-escosia-codeartifact-packages-installed

Now that the repository is ready, we can now install from AWS Glue using this Pypi mirror that we created!

AWS CodeArtifact and AWS Glue Integration

This section discusses how you can point the installation of Python Packages in AWS Glue to AWS Codeartifact.

Step 1: Get the Authorization Token

We need to generate an authorization token from AWS CodeArtifact. This is done using this command:

$ aws codeartifact get-authorization-token \
--domain my_domain \
--domain-owner 111122223333 \
--query authorizationToken \
--output text
Enter fullscreen mode Exit fullscreen mode

Note that the maximum duration of this token is 12 hours. And yes, you do need to generate this every day if you are planning to run your jobs daily.

Store this into a .txt file.

Step 2: Configure Job Details in Glue Job

Navigate to your Glue Job

I'm assuming you have already configured the Data Connections. If not kindly configure it before proceeding to this step. The idea is that the Glue Job will run inside the Private Subnet of the VPC.

See screenshot below

kyle-escosia-step-glue-configure-connection

Under Job Parameters, add the following key-value pairs:

Parameter 1


Key - "--additional-python-modules" // without double quotes

Value - "<your-python-package>==<version>"
Enter fullscreen mode Exit fullscreen mode

Parameter 2

Key - "--python-modules-installer-option"

Value - "--no-cache-dir --verbose --index-url https://aws:<CODEARTIFACT-AUTH-TOKEN>@<DOMAIN-NAME>-<ACCOUNT-ID>.d.codeartifact.<REGION-NAME>.amazonaws.com/pypi/pypi-store/simple/"
Enter fullscreen mode Exit fullscreen mode

Change the following values:

  • CODEARTIFACT-AUTH-TOKEN - refer to Step 1
  • DOMAIN-NAME
  • ACCOUNT-ID
  • REGION-NAME

Step 3: Run your Glue Job

After configuring all of that, run your Glue Job and check the CloudWatch Logs to confirm if it's being installed correctly. You should see some text there that says:

Looking in indexes: https://aws:****@test-mirror-1234561234.d.codeartifact.ap-southeast-1.amazonaws.com/pypi/pypi-store/simple/
Enter fullscreen mode Exit fullscreen mode

Kindly make sure that the IAM_ROLE that you are using for the Glue Jobs has access to write to CloudWatch Logs, some engineers usually forgets this. Also tick the Enable logs in CloudWatch on Glue Jobs.

Wrap up

That's it! In this article, we demonstrated how we can leverage CodeArtifact for managing Python packages and modules for AWS Glue jobs that run inside a Private Subnet that have no internet access.

Do let me know if you have any questions on this, happy to answer any queries you might have.

Happy Coding, builders!


This blog is authored solely by me and reflects my personal opinions, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.

💖 💪 🙅 🚩
klescosia
Kyle Escosia

Posted on November 10, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related