Movie recommender system using postgreSQL and Apache-Age
Muhammad Awais Bin Adil
Posted on March 19, 2023
In this post we shall create a very simple and rudimentary movie recommender system.
Data Preparation:
First, we need to prepare the data that we will use for building the movie recommendation system. We will use the MovieLens dataset, which is a popular dataset for building recommendation systems. The dataset contains ratings for movies given by users.
Here's the code to create a movies table and a ratings table in PostgreSQL to store the data:
CREATE TABLE movies (
id INT PRIMARY KEY,
title VARCHAR(255) NOT NULL,
genres VARCHAR(255) NOT NULL
);
CREATE TABLE ratings (
user_id INT NOT NULL,
movie_id INT NOT NULL,
rating FLOAT NOT NULL,
timestamp BIGINT NOT NULL
);
We will then insert the data into the tables. The movies table contains information about the movies such as the title and genres, while the ratings table contains information about the ratings given by the users for each movie.
Graph Creation:
Next, we need to create a graph using Apache Age that will be used for building the recommendation system. We will use the movies and ratings tables to create the graph. Each movie will be represented as a vertex, and each rating will be represented as an edge connecting the user and the movie vertices.
Here's the code to create the graph in Apache Age:
CREATE GRAPH movie_recommendations;
-- Load the movies data into the graph
LOAD INTO movie_recommendations movies
USING vertices(id)
properties(title, genres)
label('movie');
-- Load the ratings data into the graph
LOAD INTO movie_recommendations ratings
USING vertices(user_id)
edges(movie_id, rating, timestamp)
label('rating');
The LOAD INTO command is used to load the data into the graph. We specify the table name, the vertices to use, the edges to use, and the label to use for the vertices.
Feature Extraction:
Once we have created the graph, we can extract features from it that will be used for building the recommendation system. We will use the PageRank algorithm to calculate the importance of each movie in the graph.
Here's the code to calculate the PageRank for the movie vertices:
SELECT id, pagerank(movie_recommendations, id) AS rank
FROM vertices(movie_recommendations)
WHERE label = 'movie'
ORDER BY rank DESC;
The pagerank function is used to calculate the PageRank for each movie vertex. The result is ordered by rank in descending order.
Machine Learning:
Next, we will train a machine learning model that will be used for recommending movies to users. We will use the collaborative filtering algorithm to train the model. The model will be trained on the ratings data.
Here's the code to train the model:
-- Split the ratings data into training and testing sets
SELECT * INTO train_set FROM ratings WHERE random() < 0.8;
SELECT * INTO test_set FROM ratings WHERE random() >= 0.8;
-- Train the collaborative filtering model
CREATE MODEL cf_model
TRAIN('SELECT * FROM train_set',
'user_id',
'movie_id',
'rating');
The SELECT INTO command is used to split the ratings data into training and testing sets. The CREATE MODEL command is used to train the collaborative filtering model on the training set.
Real-time Monitoring:
Finally, we will use the trained model to make recommendations for a specific user. We will use the PREDICT function to predict the ratings for each movie for the user and return the top recommended movies.
Here's the code to generate movie recommendations for a user
-- Get the top 10 recommended movies for user 1
SELECT movie.id, movie.title, prediction
FROM (
SELECT id, PREDICT(cf_model, user_id => 1, movie_id => id) AS prediction
FROM vertices(movie_recommendations)
WHERE label = 'movie'
) AS recommended_movies
JOIN vertices(movie_recommendations) AS movie ON movie.id = recommended_movies.id
ORDER BY prediction DESC
LIMIT 10;
The PREDICT function is used to predict the rating for each movie for a specific user (in this case, user with ID 1). The result is joined with the movies table to get the title of each movie. The result is then ordered by prediction in descending order and limited to the top 10 recommended movies.
Real-time monitoring involves constantly updating the recommendations as new ratings are added to the database. This can be achieved by setting up a pipeline that periodically re-trains the machine learning model with the new ratings data and updates the recommendations. Apache Airflow can be used to create such a pipeline.
Additionally, we can also use different techniques such as matrix factorization, deep learning, or hybrid recommender systems to improve the accuracy of the recommendations. We can also use user feedback to further improve the recommendations.
Posted on March 19, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.