How I built a Song Recommendation System with Python, Scikit-Learn & Pandas

Introduction

You must've seen Spotify recommending songs to you.
Or other platforms like YouTube & TikTok recommend videos based on your previous viewing experience.

While building an Indie Project of mine, MeTime🎶 (an ad-free music streaming platform ...coming soon 😉) I wanted to suggest new songs to users & help them discover new sounds. Since then I decided to build a recommendation system that takes in their liked songs & gives out similar tracks or suggests new songs.

Here, I'll be talking about how I build it using python and some ML magic with scikit-learn & pandas,
Let's get started!

The Dataset

Okay, so now we've taken the quest on, we shall find a dataset that contains some Spotify tracks with their audio features (like danceability, acousticness, energy, positivity... of the song). I believe that kaggle is the best place to hunt data.
Some keyword searches and we can already find so many datasets full of Spotify tracks

On going through the searches, I found this dataset to be of use

The best thing about this dataset is that it contains every attribute we care about, popularity, energy, danceability, valence (positivity) & instrumentalness. We'll be using these features to clusterize the dataset into several parts & determine, under what cluster most of the user-favorite tracks lie in.

NOTE: You can choose basically any dataset big or small **as long as it has all the necessary audio features* (popularity, acousticness, energy, valence, danceability & song id) but you should choose as big of a dataset as possible because bigger data sets generally mean a large number of data points which offers us a good stand to recommend songs & train our ML models*

Let's now open up a jupyter notebook and import the downloaded dataset with pandas

"What just happened?"

Here, we used pandas

Pandas is a python library to read, write, manipulate, visualize & query data) to read the dataset from the csv file

...and load it in a pandas data frame.

Data frames in Pandas provide us the right functions or methods to perform the appropriate operations on the given data

Clusterizing the Dataset

Now the question arises "How are we going to clusterize the dataset based on the tracks' audio features?", here comes scikit-learn into the play, sklearn has many utility import-and-use mathematical models which make things a lot easier when handling data & performing statistical functions on it. But before we even use sklearn or scikit-learn, it's a natural tendency for newbies to ask "What the heck is that?"

SciKit-Learn is a python library that contains various readily usable modules that help in analyzing, predicting & visualizing data.

sklearn has a module called clusters which contains the KMeans model, that helps us do exactly the thing we require... divide data into certain groups. (you can have more info on how the KMeans algorithm works here)

Generally speaking, before performing any statistical operation on data, it's a good practice to first have a look at the correlations among its features & plot graphs between them so that we could get an idea & visualize any natural patterns developing within it. For that, we use our friend pandas to plot a graph between all features correlating all of them with every one of them.

In the 3rd code block of the notebook, all I did was reduce this big 'ol dataset to 1000 data points or "sliced" it (similar to slicing a list), picked out only the columns or features I want to plot the graph of (you can pick specific columns of a pandas data frame with this syntax) and passed it to the function:



pd.plotting.scatter_matrix(frame, figsize=( , ))

The reason why I reduced was if I were to perform the scatter_matrix operation on the whole dataset, it would have taken 5-7 minutes to process all the hundreds & thousands of tracks in the data frame & plot their features graphically.
Also, I used the figsize param to scale the graph providing width & height in a Python tuple (so that I could get the plot in 1 screenshot 😅).

Having looked at plots, we can see quite good correlations between features like danceability, valence & energy. To confirm this, let us draw a numerical table to see the correlations in numbers with pandas built-in methods

Now, coming to our specific case where we would be wanting to create groups in the data, seeing the correlations visually & numerically, no apparent natural groups are forming in this data. Thus, we will be testing out & tinkering a little bit with how many clusters we want to create, too low and we end up categorizing multiple genres in 1 cluster, too high and we lose tracks of the same genres in a cluster.

Let's see how we can do this in Python syntax...

Here, I picked up the main features which heavily impact what type of song it is, hence impacting what cluster it belongs to & stored the new data frame in the tracks variable.

Then, I initialized the KMeans model & stored it in the corresponding variable, after that, I "fitted" or "trained" the model on the data using the .fit() method and passed in the data we want the model to "train" on.

By "training" or "fitting" I mean that the KMeans model will determine in what cluster each track with its particular audio features SHOULD lie in & assign cluster numbers to it

After fitting the data set, it will store the cluster numbers (which will be a list of positive integers ranging from 0 to 4 in this case) in a variable called labels_ arranged in a way that corresponds to each track in the data frame.

Giving out recommendations

Now, coming to the last act of our quest, what we're going to do is, Edit & save the above generated labels_ in a column for each track, save the dataset in a separate csv file and find out what cluster number occurs the most in a user's playlist thus finding out what type of songs user likes the most & giving them tracks with same cluster number or you could say of same "type"

In pandas, we add a column to a data frame with the following syntax:



dataframe[column_name] = an_array_with_same_number_of_rows

After adding, its time to save the newly altered data frame in a csv file by using the .to_csv() method of our convenient pandas data frame



tracks_df.to_csv('../result.csv')

Let's now create a Python program that would take in the IDs of user favorite tracks and suggest 'em some nice songs!



# CONSTRUCT YOUR OWN LOGIC!
import pandas as pd

tracks = pd.read_csv('./result.csv')

ids = input('Enter comma-separated ids of your favorite songs\n> ').strip().split(',')
# sample input: 1xK1Gg9SxG8fy2Ya373oqb,1xQ6trAsedVPCdbtDAmk0c,7ytR5pFWmSjzHJIeQkgog4,079Ey5uxL04AKPQgVQwx5h,0lizgQ7Qw35od7CYaoMBZb,7r9ZhitdQBONTFOiJW5mr8,3ee8Jmje8o58CHK66QrVC2,3ZG8N7aWw2meb6UrI5ZmnZ,5cpJFiNwYyWwFLH0V6B3N8,26w9NTiE9NGjW1ZvIOd1So,7BIy3EGQhg98CsRdKYHnJC,2374M0fQpWi3dLnB54qaLX,2IVsRhKrx8hlQBOWy4qebo,40riOy7x9W7GXjyGp4pjAv,4evmHXcjt3bTUHD1cvny97,0MF5QHFzTUM2dYm6J7Vngt,0TrPqhAMoaKUFLR7iYDokf,07KXEDMj78x68D884wgVEm,6gxKUmycQX7uyMwJcweFjp

# search the specified ids in this dataset and get the tracks
favorites = tracks[tracks.id.isin(ids)]

# code to sort find out the maximum occuring cluster number according to user's favorite track types
cluster_numbers = list(favorites['type'])
clusters = {}
for num in cluster_numbers:
  clusters[num] = cluster_numbers.count(num)

# sort the cluster numbers and find out the number which occurs the most
user_favorite_cluster = [(k, v) for k, v in sorted(clusters.items(), key=lambda item: item[1])][0][0]

print('\nFavorite cluster:', user_favorite_cluster, '\n')

# finally get the tracks of that cluster
suggestions = tracks[tracks.type == user_favorite_cluster]

# now print the first 5 rows of the data frame having that cluster number as their type
print(suggestions.head())

To the above code, you would want to do some improvisations and I knowingly left room for some, but there's one big problem about the above mechanism, It will recommend non-hits from the 1950s or 1960s too and I think no one would be willing to hear them over a popular song like "Despacito" (still love that song), so it would be a good move to filter out songs according to popularity say > 60 or 70 the moment we read the dataset. Something like this:



...
tracks = tracks[tracks.popularity > 70]
...

And... that was it! this was the bare minimum code to suggest similar tracks to users. You can now host a full-fledged flask server for just providing recommendations taking in the liked track ids from the user or make a CLI tool for devs to discover new songs from Spotify.

Integrate Spotify API & away you go!

Meet you another day with another post ;)