Getting valuable insights from TMDb movies dataset.
Charles Maina Mwangi
Posted on June 8, 2022
I have had an amazing opportunity to participate in a Nanodegree in Udacity in the past month and I want to share my interesting journey. I am an aspiring data analyst and in this blog, I will share my first project for investigating the TMDb movie dataset.
Project: Investigate the Movies Dataset
Table of Contents
Introduction
This dataset is dowanloaded from kaggle.com. It is the Movies dataset. It contains information about over 10000 movies that are collected from The Movie Database (TMDb). The dataset includes the movie title, directors, cast, among others. Some of the columns like cast, keywords, and genres contains multiple values and separated by pipe (|) characters.
The following are the questions I ask myself for this analysis.
- Research Question 1 (Which movie genre is the most popular?)
- Research Question 2 (Who is the most common director for movies?)
- Research Question 3 (Which are the most common keywords?)
- Research Question 4 (Which production companies produce the most movies?)
- Research Question 5 (Which is the most common tagline?)
# Use this cell to set up import statements for all of the packages that you
# plan to use.
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline
Data Wrangling
This step will involve loading the dataset, cleaning the dataset and trimming it.
General Properties
# Load your data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
movies_df = pd.read_csv("tmdb-movies.csv")
movies_df.head(5)
# Check number of rows and columns
movies_df.shape
(10866, 21)
# Describe the data
movies_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 10866 non-null int64
1 imdb_id 10856 non-null object
2 popularity 10866 non-null float64
3 budget 10866 non-null int64
4 revenue 10866 non-null int64
5 original_title 10866 non-null object
6 cast 10790 non-null object
7 homepage 2936 non-null object
8 director 10822 non-null object
9 tagline 8042 non-null object
10 keywords 9373 non-null object
11 overview 10862 non-null object
12 runtime 10866 non-null int64
13 genres 10843 non-null object
14 production_companies 9836 non-null object
15 release_date 10866 non-null object
16 vote_count 10866 non-null int64
17 vote_average 10866 non-null float64
18 release_year 10866 non-null int64
19 budget_adj 10866 non-null float64
20 revenue_adj 10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
# Check dataset datatypes
movies_df.dtypes
id int64
imdb_id object
popularity float64
budget int64
revenue int64
original_title object
cast object
homepage object
director object
tagline object
keywords object
overview object
runtime int64
genres object
production_companies object
release_date object
vote_count int64
vote_average float64
release_year int64
budget_adj float64
revenue_adj float64
dtype: object
Cleaning the data
Here, we first drop all duplicates present in the dataset.
The next step involves checking the number of null values present in each column. After determining the null values, we then drop them all and check the dataset for any other null or missing values.
# Drop all duplicate values.
movies_df.drop_duplicates(inplace=True)
# Check all null and missing values
movies_df.isnull().sum()
id 0
imdb_id 10
popularity 0
budget 0
revenue 0
original_title 0
cast 76
homepage 7929
director 44
tagline 2824
keywords 1493
overview 4
runtime 0
genres 23
production_companies 1030
release_date 0
vote_count 0
vote_average 0
release_year 0
budget_adj 0
revenue_adj 0
dtype: int64
# Remove and drop all null values and missing values
movies_df = movies_df.dropna(subset=['imdb_id','cast','homepage','director','tagline','keywords','overview','genres','production_companies'])
# Check again to confirm all the missing and null values have been droppped.
movies_df.isnull().sum()
id 0
imdb_id 0
popularity 0
budget 0
revenue 0
original_title 0
cast 0
homepage 0
director 0
tagline 0
keywords 0
overview 0
runtime 0
genres 0
production_companies 0
release_date 0
vote_count 0
vote_average 0
release_year 0
budget_adj 0
revenue_adj 0
dtype: int64
Exploratory Data Analysis
Research Question 1 (Which movie genre is the most popular?)
# Function to count the top 15 most common in the dataset
def counting(param):
counter = Counter()
for i in param:
counter[i] += 1
top15_most_common = counter.most_common(15)
return top15_most_common
# Continue to explore the data to address your additional research
# questions. Add more headers as needed if you have more questions to
# investigate.
genres = "|".join(movies_df['genres']).split("|")
top15_most_common_genres = counting(genres)
genres_df = pd.DataFrame (top15_most_common_genres, columns = ['genres','Genres Count'])
genres_df.sort_values(ascending=True, by='Genres Count', inplace=True)
ax = genres_df.plot.barh(x='genres',color=["#FF69B4","#F0F8FF"],width=0.7,legend=False, figsize=(10, 10), title="A bar plot to find the most common genre of the movies")
ax.set_ylabel("Genres")
ax.set_xlabel("Genres counts")
Text(0.5, 0, 'Genres counts')
This bar chart is used to analyze the most common genrein this movie dataset. The analysis is done for the top 15 most common genres with their number of counts in the dataset. It is clear that most of the movies produced directed are in the Drama Genre type. This is followed by Comedy, Thriller, Action and Adventure movies. It measn that these are the most watched movies.
Research Question 2 (Who is the most common director for movies?)
director = "|".join(movies_df['director']).split("|")
top15_most_common_directors = counting(director)
director_df = pd.DataFrame (top15_most_common_directors, columns = ['director','Directors'])
director_df.sort_values(ascending=False, by='Directors', inplace=True)
ax = director_df.plot.bar(x='director',color=["#746AB0"],width=0.7,legend=False, figsize=(10, 10), title="A bar plot to find the director who produced the most movies. ")
ax.set(xlabel="Directors", ylabel="Directors counts")
[Text(0.5, 0, 'Directors'), Text(0, 0.5, 'Directors counts')]
In research question two, I went further to investigate which director had produced the most movies from this dataset. I did that by finding the most common director by ther counts. I picked the top 15 most common to make the plot as shown above. I found out that John Carpenter was the most common director. Hence, John Carpenter was the director who had produced the most moves.
Research Question 3 (Which are the most common keywords?)
keywords = "|".join(movies_df['keywords']).split("|")
keywords
top15_most_common_keywords = counting(keywords)
keywords_df = pd.DataFrame (top15_most_common_keywords, columns = ['keywords','Keywords'])
keywords_df.sort_values(ascending=True, by='Keywords', inplace=True)
keywords_df.plot.barh(x='keywords',color=["#FFCE30"],width=0.7,legend=False, figsize=(10, 10),xlabel="Keywords", ylabel="keyword counts", title="A bar plot to find the most common used keyword in naming movies.")
<AxesSubplot:title={'center':'A bar plot to find the most common used keyword in naming movies.'}, ylabel='Keywords'>
I went further to find out which keyword was the most used in the movies. I was able to find this by analyzing the dataset for the most common top 15 keywords used by their counts. I found out that based on the novel was the most used and most common keyword.
Research Question 4 (Which production companies produce the most movies?)
movies_df['production_companies'].value_counts()[:15].plot(kind='pie',figsize=(10,10))
<AxesSubplot:ylabel='production_companies'>
I investigated the production company that produces the most movies. I did that by plotting a pie char of the top 10 production companies with their counts. I found out that Walt Disney Pictures produced the most movies.
Research Question 5 (Which is the most common tagline?)
movies_df['tagline'].value_counts()[:10].plot(kind='pie',figsize=(10,10))
<AxesSubplot:ylabel='tagline'>
# Length of movies with budget
plt.hist(movies_df.budget[movies_df.runtime > 100], 25, facecolor="b", alpha=1,label="long movies")
plt.hist(movies_df.budget[movies_df.runtime < 100], 25, facecolor="g", alpha=0.75,label="short movies")
plt.legend()
plt.xlabel('Budget')
plt.ylabel('Runtime Hours')
plt.title('Histogam')
plt.grid(True)
Conclusions
In the first section, I analyzed the most common genre of the movies. From the analysis, I can conclude that the top 5 genres were Drama, Comedy, Drama|Romance, Documentary,and Horror|Thriller. In the second secion, the director who directed the most movies was John Carpenter. He was followed by Steven Splelberg, Steven Soderbergh, Ridley Scott, and Robert Zemeckis. These were the most top 5 common directors. In the third section, I wanted to determine which keywords were mostly used to search for movies and the keywords used in the movies. The top 5 keywords used were Woman director, duringcreditsstringer, independent film, biography, and christmas. Further, it wa important to determine which different compnanies are known to produce and direct movies. The top five production companies are Universal Pictures, DreamWorks Animation, Marvel Studios, Walt Disney Pictures|Pixar Animation Studios, and Walt Disney Pictures. Finally, I was able to determine that the most common tagline in this analysis is Two films One love.
Limitations
In the first section, the type of genres were too many to analyze and come up with a good conclusion from the analysis. Therefore, I only took an analysis of the top 15 most common types of genres. This was the same with the directors and keywords section where I only analyzed the top 15 common directors and keywords respectfully. A pie plot of the production companies also had many results from the analysis with the data points almost the same length. Therefore, I was only able to plot the first 15 production companies to show the relationship.
References.
https://www.kaggle.com/code/danielleho/exploratory-analysis-netflix-shows-and-movies
Posted on June 8, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.