Why you should explore your data before feeding Amazon Personalize

Alexa...set a timer for 15 minutes. ⏳

In my previous blog post, I showed you how to automate the provisioning of Sagemaker notebook instances. Let us now use this notebook instance for data exploration and data analysis as part of the Amazon Personalize Kickstart project.

The goal of this project is to provide you a kickstart for your personalization journey when building a recommendation engine based on Amazon Personalize. It will serve you as a reference implementation you can both learn the concepts and integration aspects of Amazon Personalize.

🕵️ Data exploration is an essential part of your machine learning development process

Before you just import your historical data, it is recommended to gather knowledge. Both on your data and on your business domain. Every recommendation engine project is kind of unique if we look at the data we have to process and the way the business works. In a very first step during a proof-of-concept phase, it is all about finding answers on:

What data can we use?
What data do we need?
Is our data quality sufficient?
How do we access the required data?
How do we identify users, interaction or items we want to recommend?

Collaborative sessions with subject matter experts help us in building an optimal solution along the given circumstances. Making decisions is easy. Making the right decision is the challenge. In my opinion, data exploration is one of the most important parts in your machine learning development process.

To formulate it a bit more drastically: without data analysis and exploration, you can only do the right thing by accident.

🏁 What do we want to achieve?

We want to build a recommendation engine covering all features of Amazon Personalize. The dataset we will use is the publicly available MovieLens dataset.

GroupLens Research has collected and made available rating data sets from the MovieLens web site (https://movielens.org). The data sets were collected over various periods of time, depending on the size of the set. Before using these data sets, please review their README files for the usage licenses and other details.

Source: https://grouplens.org/datasets/movielens/

The MovieLens dataset contains 25 million movie ratings and a rich set of movie metadata. We will use this data to provide an initial version of our recommendation engine based on historical data.

My goal is to not reinvent the wheel at all. But bring relevant analyses in one place that help us to judge if our data fits to be used for a recommendation engine based on Amazon Personalize.

Those analyses are both inspired from my personal experiences as well as a lot of cool stuff of the open source community like the following:

📊 From data to answers

Before you start with your analysis, it is recommended to define some key questions you would like to answer. You can then use the insights and knowledge you gained to discuss them with subject matter experts.

Well in our kickstart project, unfortunately there are no subject matter experts available right now. But let us start with what we have: 🤖 data and a 📖 README!

By analyzing the Movielens datasets we want to answer some very specific questions about our movie business:

What are the top 10 most rated movies?
Are ratings in general more positive or negative?
Is there a correlation between genres?

So let us get started and dive into our datasets. 🤿

🗺 Data exploration samples

For a complete overview of all analysis results, please check the complete Jupyter notebook on github.

Before we start, let us do some basic setup like importing libraries, downloading the sample data and loading them into dataframes.

from datetime import datetime
import pandas as pd

data_dir = "movielens"
dataset_dir = data_dir + "/ml-latest-small/"
!mkdir $data_dir

!cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!cd $data_dir && unzip ml-latest-small.zip

raw_ratings = pd.read_csv(dataset_dir + '/ratings.csv')
raw_movies = pd.read_csv(dataset_dir + '/movies.csv')
movie_rating = pd.merge(raw_ratings, raw_movies, how="left", right_on="movieId", left_on="movieId")

What are the top 10 most rated movies?

We want to know better, what movies are top rated in our system. We use the merged dataframe of movies and ratings, group it by title and sort by the number of rows per movie to get the top 10 movies.

top_ten_movies = movie_rating.groupby("title").size().sort_values(ascending=False)[:10]
top_ten_movies.plot(kind="barh")

If we build our recommender system based on ratings, we have to check if we have some bias in our data. It could happen that top rated movies are recommended more often compared to less rated videos in the end. This is something to be discussed with subject matter experts to have clear expectations.

Are ratings in general more positive or negative?

We want to now more about the distribution of ratings. Our hypothesis is, that recommending low rated videos might not be a good user experience. On the other side we might not be too aggressive as it can lead to biased recommendation by ignoring those low rated videos. Maybe there are users that are still interested in low rated videos because they fit their favorite genre. Who knows?

Let us in a first step visualize the distribution of all ratings. In a next step we will categorize ratings that are lower than 3.0 as a negative rating. All other ratings will be categorized as a positive rating.

raw_ratings['rating'].value_counts().sort_index().plot(kind='barh')

We now map each rating that is bigger than 3.0 to a positive sentiment and all other ratings to a negative sentiment.

rating_sentiment = raw_ratings.copy()
rating_sentiment["sentiment"] = rating_sentiment["rating"].map(lambda x: "positive" if x > 3.0 else "negative")
rating_sentiment['sentiment'].value_counts().plot(kind='barh')

We now get an idea that the majority of ratings are "positive".

How many videos are released per year?

movies = raw_movies.copy()
movies['release_year'] = movies['title'].str.extract('\((\d{4})\)',expand = False)
movies = movies.dropna(axis=0)
movies['release_year'] = movies['release_year'].astype('int64')
movies['title'] = movies['title'].str.extract('(.*?)\s*\(', expand=False)

movie_year = pd.DataFrame(movies['title'].groupby(movies['release_year']).count())
movie_year.reset_index(inplace=True)
movie_year.plot(x="release_year", y="title", legend=False, xlabel="Release year", ylabel="Number of movies", figsize=(12, 6));

Movies range from release dates from 1902 till 2018. Since round about the year of 1980 the amount of released movies seems to be increasing more strongly. There is an interesting drop of releases round about in the year of 2012. In 2018 nearly the same amount of movies were release like in the end of the 70s.

If there were subject matter experts in place, those analysis might result in some very interesting question to better understand the driver of both the increase in the 80s but also the drop after 2010.

💡 Conclusions

Data exploration and the learnings you gained out of data puts yourself in an excellent position. Ideally you formulated the business problem you want to solve upfront. Define some relevant KPIs you want to improve. Based on your learnings you can now dive deeper into what is possible in your situation. Challenge your KPI definition or define additional hypotheses that will guide you on your journey.

Alexa says, time is over...see you next time.

I am not a data scientist and would never consider myself to have a deep knowledge in this context. But I have to admit that I am getting a bit obsessed about those things around data science, data exploration, analysis and data driven decisions. I observed a lot and tried to be the sponge that soaks up everything in this area.

Hence I am really interested in your feedback, experience and thoughts in the comments. 👋

Cover Image by Andrew Neel - https://unsplash.com/photos/z55CR_d0ayg

Blog