Build Your Movie Recommendation System Using Amazon Personalize, MongoDB Atlas, and AWS Glue

Author:
Siddharth Joshi, (Technical Account Manager at AWS)
Sornavel Perumal (Technical Account Manager at AWS)

Contributor:
Babu Srinivasan (Senior Partner Solutions Architect at MongoDB)

In today's data-driven world, personalized recommendations have become an integral part of enhancing user experiences. With the power of cloud computing and advanced database solutions, building your own personalized movie recommendation system is now more achievable than ever. In this article, we'll explore the integration of MongoDB Atlas, AWS Glue, and Amazon Personalize to create a robust and scalable recommendation engine.

Understanding the components

Before diving into the integration process, let's briefly understand the key components involved in our movie recommendation system:

MongoDB Atlas is a fully managed, cloud-based database service that enables seamless deployment, scaling, and maintenance of MongoDB databases.

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analysis. It helps bridge the gap between our MongoDB Atlas data and the services we'll use for recommendation.

Amazon Personalize is a machine learning service that makes it easy to build, train, and deploy personalized recommendation models. It will analyze the data from MongoDB Atlas and generate personalized movie recommendations for users

Reference architecture

This architecture seamlessly ingests data from MongoDB Atlas, powering personalized recommendations. The AWS Glue Spark Job extracts transforms (filtering, cleaning, joining), and loads data into S3. This prepared data becomes the foundation for your chosen AI/ML service (SageMaker, Personalize, etc.), enabling highly accurate and personalized recommendations.

Prerequisites

This tutorial will be well understood if you have a good understanding of MongoDB Atlas services and Amazon Web Services (AWS), mentioned in the above architecture diagram.

Setting up MongoDB Atlas for movie data

Begin by creating a MongoDB Atlas database to store information about movies, genres, and user interactions. Populate the database with relevant data, ensuring it is well-structured for the recommendation model.
For this article, we will be using the MovieLens dataset.

a. If you do not already have one, you can sign up for a MongoDB Atlas account.

b. Create a database named movielens.

c. Get your connection URI to connect to MongoDB Atlas noted down.

d. Download the MovieLens dataset.

e. Unzip the file locally and run the Python script to upload data to MongoDB Atlas. (Replace and path_to_extracted_files.)

Using AWS Glue for data preparation

AWS Glue comes into play for ETL operations. Create a Glue job to extract data from MongoDB, transform it into a suitable format for training the recommendation model, and load it into an Amazon S3 bucket.

a. Create an S3 bucket to store the processed file from Glue.

b. Store your MongoDB connection properties credentials in AWS Secrets Manager.

c. Create a new AWS Glue Studio job with the Spark script editor option.

From the AWS Glue Studio console, select jobs from the menu and select “Script editor.”

Select the Spark option from the dropdown menu and click Create script.

d. Create an ETL job using Glue. Replace the Python script.

e. Specify input arguments.

Key : Value
--BUCKET_NAME : <bucket_name>
--OUTPUT_FILENAME1 : ratings
--OUTPUT_FILENAME2 : items
--COLLECTION_NAME3 : movies
--COLLECTION_NAME2 : tags
--COLLECTION_NAME1 : ratings
--SECRET_NAME : <name_of_secret>

f. Run the job.

Create a dataset group and an interactions dataset in Amazon Personalize

a. Go to Amazon Personalize in your AWS console.
b. In the left navigation pane, click on Dataset groups.
c. Click the Create dataset group button. Enter movie-datasetgroup as the name for your dataset group. Select Video on demand as the Domain.

In Amazon Personalize, click Create dataset group, provide the name of your dataset, and select the “Video on demand” option. Click Create group.

d. After creating the dataset group, you need to add datasets to it. Click on the dataset group you just created.
e. Click on the Create dataset button, and select Item interactions dataset.

Select the “Item interactions dataset” from the dropdown menu.

f. Select Import data directly into Amazon Personalize datasets as the Import method.

Choose the “Import data directly into Amazon Personalize datasets.”

g. Provide movie-interactions as Dataset name and Schema name.

Give a name to the Dataset and select “Create a new domain schema by modifying the existing default schema for your domain.”

h. To configure your dataset, import the job, select Import data from S3, and provide a name for the import job. Specify the path of ratings.csv in your S3 bucket as the data location, and specify an IAM role that has access to the S3 bucket.

Select “Import data from S3.”

Select the S3 data location.

Create the User dataset

The User dataset is the dataset for all the users listed in the system. MovieLens does not provide a user dataset so we will be using one that has been created for this post. In the real world, this dataset would be coming from your application.

a. Copy the users.csv file and put it on your S3 bucket created earlier.
b. Follow the steps above to create the User dataset.

Select “Users dataset” from the dropdown menu.

Choose “Import data directly into Amazon Peronalize Datasets.”

Provide a name for the Dataset and choose “Create a new domain schema by modifying the existing default schema for your domain.”

c. Ensure that the Schema definition looks like the following:

{
“type” : “record”,
“name” : “Users”,
“namespace”: “com.amazonaws.personalize.schema”,
“fields” : [
       { “name”: “USER_ID”, “type”: “string”},
       { “name”: “SUBSCRIPTION_MODEL”, “type”: “string”,”categorical”: true},
               ],
“version”:”1.0”
}

d. Provide S3 as the data import source and the S3 path for users.csv as the data location.

Select the “Import data from S3” option and provide the name to “Dataset import job name.”

Select the S3 bucket location and provide the IAM role.

Create the Items dataset

The Items dataset refers to a list of all the movies available in our application. Our Glue ETL job has converted the MongoDB collection “movies” into a .csv file in a format usable with Amazon Personalize.
You need to follow similar steps as above to create an Items dataset. For data location, you need to provide the path of items.csv on your S3 bucket.

a. Select the “Items dataset” from the dropdown menu.

b. Select the “Import data directly into Amazon Personalize datasets” option.

c. Provide a name — “movie-item” — to the Dataset name and select the “Create a new domain schema by modifying the existing default schema for your domain” option.

d. Select the “Import data from S3” option and provide the Dataset import job name — “movie-ds-item.”

Select the S3 location for the Data location and provide the IAM Role.

e. Before proceeding to the next step, you should wait until all three datasets become active.

Ensure you can see “3/3 datasets active” in green.

Run data analysis

Now, use Amazon Personalize for analyzing the data imported, meaning Users, Item interactions, and item datasets.
Start the data analysis by clicking on Run data analysis.

Ensure the data analysis run has been completed successfully.

Create recommenders

Create the recommenders after the Domain dataset group is created successfully. A recommender is a Domain dataset group resource that generates recommendations. Use a recommender in the application to get real-time recommendations with the GetRecommendations operation.

a. select Use e-commerce recommenders.

b. For the use case, select Because you watched X and provide a name to the recommender.

c. You can leave Advanced configuration as the default.

d. Review the configuration and click on Create recommenders.

e. Before proceeding to the next step, please wait until the recommender becomes active.

Ensure the status is “Active” for the movie-recommender.

Test recommender

Now that we have created a recommender, we are ready to get recommendations. In a real-world scenario, our application would be sending requests to Amazon Personalize and getting recommendations. For the post, we will test it using the Amazon Personalize console.
a. Go to the Amazon Personalize console.
b. Click on Recommenders under movie-dataset group and select movie-recommender.

c. Click on Test.

d. Enter a valid user ID and movie ID (Item ID), and click on Get recommendations.

e. The recommender will provide the list of recommendations in the form of the movie ID.

In a real-world scenario, your application will map these movie IDs to movie names and will show them as recommendations to users.

Conclusion

In this post, we explored the integration of MongoDB, AWS Glue, and Amazon Personalize to build a personalized movie recommendation system. This powerful combination allows you to leverage the flexibility of MongoDB, the data preparation capabilities of AWS Glue, and the machine learning prowess of Amazon Personalize to deliver a tailored and engaging user experience. As you embark on your journey to enhance user engagement, this integration offers a scalable and efficient solution for building recommendation systems in various domains.

Refer to the following links for further reading:
Writing an AWS Glue for Spark script
MongoDB Atlas database
MongoDB Community Forum

Blog