Content-based Recommender System with Python
seniordatascientist
Posted on January 4, 2022
Recommender systems are methods that help us predict interests of users and generate relevant recommendations for them for different products or services. These products can range from songs to play on Apple Music to movies to watch on one of the streaming services, articles to read on news journal or products from Amazon.
Recommender systems are differentiated mainly by the type of data in use.
Whereas content-based recommenders rely on features of users and/or items, the collaborative filtering uses information on the interaction between users and items, as defined in the user-item matrix.
Recommender systems are generally divided into 3 main approaches:
- content-based recommendation engines
- collaborative filtering recommendation engines
- and hybrid recommendation systems
What are content-based recommender systems?
Content-based recommenders produce recommendations using the features or attributes of items and/or users.
User attributes can include age, sex, job and other personal information. Item attributes are different in that they are of descriptive kind that distinguishes items from each other.
Example features for movies would be title, cast, description, genre and others.
Content-based methods, by means of their reliance on features are similar to traditional machine learning models which are often feature based.
One of the inherent advantages of content-based recommenders is that they have a certain degree of user independence. To generate recommendation for a user, they namely do not need information about other users, like the CF (collaborative filtering) methods do.
Content-based approach is thus easier to scale. Explainability of AI models has become very important in last years. There has been a whole field developed from efforts in this area - called XAI.
There are many nice libraries available to help explainability of AI predictions, personally I like SHAP and LIME.
Content-based methods are better from respect of explainability as it is easier to explain their recommendations than in case of collaborative filtering.
Although CF methods also have some explainability available. CF library https://github.com/benfred/implicit which I used a lot in my past projects, e.g. has the method model.explain available for that.
Returning back to content-based approach, it also has its drawbacks. One of them is that it can over-specialize – if the user is only interested in specific categories, recommender will have difficulty recommending items outside of this area. This can lead the user to remain in the area of current items.
I will now build an example of content-based recommender in python, by using the MovieLens data.
Content-based recommender system for recommendation of movies
Our recommender system will be able to recommend movies to us.
First, we load the models:
import pandas as pd
import ast
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt`
import pandas as pd
import ast
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
We next get our data set data from https://www.kaggle.com/rounakbanik/the-movies-dataset and https://grouplens.org/datasets/movielens/latest/:
df_data = pd.read_csv(‘movies_metadata.csv’, low_memory=False)
As part of pre-processing we remove movies which have low number of votes:
df_data = df_data[df_data['vote_count'].notna()]
plt.figure(figsize=(20,5))
sns.distplot(df_data['vote_count'])
plt.title("Histogram of vote counts")
df_data = df_data[df_data['vote_count'].notna()]
plt.figure(figsize=(20,5))
sns.distplot(df_data['vote_count'])
plt.title("Histogram of vote counts")
# determine the minimum number of votes that the movie must have to be included
min_votes = np.percentile(df_data['vote_count'].values, 85)
1
min_votes = np.percentile(df_data['vote_count'].values, 85)
# exclude movies that do not have minimum number of votes
df = df_data.copy(deep=True).loc[df_data['vote_count'] > min_votes]
1
df = df_data.copy(deep=True).loc[df_data['vote_count'] > min_votes]
Content-based recommender will have a goal of recommending movies which have a similar plot to a selected movie.
We will use “overview” feature from our dataset:
# removing rows with missing overview
df = df[df['overview'].notna()]
df.reset_index(inplace=True)
# processing of overviews
def process_text(text):
# replace multiple spaces with one
text = ' '.join(text.split())
# lowercase
text = text.lower()
return text
df['overview'] = df.apply(lambda x: process_text(x.overview),axis=1)
# removing rows with missing overview
df = df[df['overview'].notna()]
df.reset_index(inplace=True)
# processing of overviews
def process_text(text):
# replace multiple spaces with one
text = ' '.join(text.split())
# lowercase
text = text.lower()
return text
df['overview'] = df.apply(lambda x: process_text(x.overview),axis=1)
To compare movie plots, we first need to compute their vector representation. There are various methods available from from bag of words, word embeddings to TF-IDF, we will select the latter.
TF-IDF approach
TF-IDF of a word in a text which is part of a larger corpus of text is a combination of two values. One is term frequency (TF), which measures how frequently the word occurs in the document.
However, some of the words, such as “the” and “is”, occur frequently in all documents and we want to downsize their importance. This is done by multiplying term frequency with the inverse document frequency.
In this way only those words are considered relevant for the document that are frequent in this text but more rarely present in the rest of the corpus.
For building the TF-IDF representation of movie plots we will use the TfidfVectorizer from scikit-learn. We first fit TfidfVectorizer on train data set of movie plot descriptions and then transform the movie plots into TF-IDF numerical representation:
tf_idf = TfidfVectorizer(stop_words='english')
tf_idf_matrix = tf_idf.fit_transform(df['overview']);
tf_idf = TfidfVectorizer(stop_words='english')
tf_idf_matrix = tf_idf.fit_transform(df['overview']);
We can now compute similarity of movies by calculating their pair-wise cosine similarities and storing them in cosine similarity matrix:
calculating cosine similarity between movies
cosine_similarity_matrix = cosine_similarity(tf_idf_matrix, tf_idf_matrix)
1
2
3
# calculating cosine similarity between movies
cosine_similarity_matrix = cosine_similarity(tf_idf_matrix, tf_idf_matrix)
With cosine similarity matrix computed, we can define the function “recommendations” that will return top recommendations for a given movie:
def index_from_title(df,title):
return df[df['original_title']==title].index.values[0]
# function that returns the title of the movie from its index
def title_from_index(df,index):
return df[df.index==index].original_title.values[0]`
# generating recommendations for given title
def recommendations( original_title, df,cosine_similarity_matrix,number_of_recommendations):
index = index_from_title(df,original_title)
similarity_scores = list(enumerate(cosine_similarity_matrix[index]))
similarity_scores_sorted = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
recommendations_indices = [t[0] for t in similarity_scores_sorted[1:(number_of_recommendations+1)]]
return df['original_title'].iloc[recommendations_indices]
def index_from_title(df,title):
return df[df['original_title']==title].index.values[0]
# function that returns the title of the movie from its index
def title_from_index(df,index):
return df[df.index==index].original_title.values[0]
# generating recommendations for given title
def recommendations( original_title, df,cosine_similarity_matrix,number_of_recommendations):
index = index_from_title(df,original_title)
similarity_scores = list(enumerate(cosine_similarity_matrix[index]))
similarity_scores_sorted = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
recommendations_indices = [t[0] for t in similarity_scores_sorted[1:(number_of_recommendations+1)]]
return df['original_title'].iloc[recommendations_indices]
We can now produce our recommendation for a given film, e.g. ‘Batman’:
recommendations(‘Batman’, df, cosine_similarity_matrix, 10)
3693 Batman Beyond: Return of the Joker
5962 The Dark Knight Rises
7379 Batman vs Dracula
5476 Batman: Under the Red Hood
6654 Batman: Mystery of the Batwoman
3911 Batman Begins
6334 Batman: The Dark Knight Returns, Part
1770 Batman & Robin
4725 The Dark Knight
709 Batman Returns
In the second article, we will build another content-based recommender.
Use cases of content-based recommenders
Content-based recommenders can be used for many different purposes. We have used it for many different platforms in the past. At online-stores.ai we built a content-based recommender which suggests similar stores from given input online store, using the product names as the relevant feature. Product names were transformed into vector representation using sentence embeddings. We were surprised how good the online stores content based recommender was, using this approach.
For another platform, trending-products.io we built a content-based recommender which predicts, for given trending product, what other trending products would be also interesting for you. The key part here was using product categorization API for classifying the trending products in many categories according to Google Taxonomy. We used product categorization API for this purpose as classifying it manually would take way too much time, as the number of trending products that are covered is over 0.5 million.
These are only a few content-based recommender use cases, there are many others out there. But what they share is vectorization of features and then finding the suggestions using commonly machine learning library for this purpose. We can recommend Spotify's annoy library: https://github.com/spotify/annoy for this purpose when you are dealing with millions of vectors and 100+ dimension of the vectors.
Posted on January 4, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.