Investigating the TMDB movie dataset
nkpremices
Posted on March 11, 2020
Lately, I've been going through the Data analyst nanodegree program of Udacity. I worked on some projects there and I will be writing blog posts about them in the coming weeks.
Note:
This blog post is the first part of a whole series of blogposts where I describe a whole dataset analysis. The aim is to showcase how simple the data analysis can be.
Introduction
About the dataset
The dataset is called TMDB movie data. Downloaded from this page, its original version was removed by Kaggle and replaced with a similar set of movies and data fields from The Movie Database (TMDb). It contains more than 5000 movies and their rating and basic information, including user ratings and revenue data.
A successful movie is evaluated by its popularity, vote average score(Ratings) and revenue. There are some keys that can affect the success of a movie. For example, the Budget, Cast, Director, Tagline Keywords, Runtime, Genres, Production Companies, Release Date, Vote Average, etc.
Looking at how the data is in the dataset, various questions can be asked. For example -
- How was the popularity of a movie over the years?
- Considering the five recent years, how is the distribution of revenue in different score rating levels ?
- How is the distribution of revenue in different popularity levels ?
- What kinds of properties are associated with movies that have high popularity?
- What kind of properties are associated with movies that have high voting score?
- How many movies are released year by year ?
- What are the keywords trends by generation ?
In this series of blog posts, we are going to answer the questions above using the TMDB Movie data, Numpy, Pandas, and Matplotlib.
for this blogpost, we will focuss on general comments about the data
Firt of all, let's import the needed packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
%matplotlib inline
Data Wrangling
General Properties
Let's load the info of the dataset
df = pd.read_csv('tmdb-movies.csv')
df.info()
Judging form the info above, the dataset has 10866 entries and 21 colums. The types used are int, float and string. Form the total number of entries and the number of entries per column, a lot of columns have null values. Let's check the exact number of null records per column.
list(df.isnull().sum().items())
Looking at the result above, we see that the colums that have null values are cast, homepage, director, tagline, keywodds, overview, genres, production companies. We also see that homepage, tagline, keywords and production_companies have a lot of null records. I decided to get rid of tagline and keywords since they have a lot of null values.
Let's try to get more descriptive information from the dataset
df.describe()
If we look at the popularity column, we can find some outliers. since it has no upperbond, it is better to just retain the original data. We can see that there is a lot of zero values in the budget, revenue and runtime columns. The first guess might be that these movies were not released but if we look at the release_year column we can notice that the minimum value (1996) is avalid year and that there were no null values. Therefore those movies were released. May be the zeroes mean the abscence of data. However, in order to decide on that let's check closely those records
First for the budget
df_budget_zero = df.query('budget == 0')
df_budget_zero.head(3)
Then for the revenue
df_revenue_zero = df.query('revenue == 0')
df_revenue_zero.head(3)
After checking for Mr Afonso poyart on the film Solace#Production) on wikkipedia, I noteiced that the film was actually a success. WHich means that there was a sucessful release wich also means that there was a budget. Therefore, the zero values were missing data. I would decide based on that to drop the records since this might affect the statistics od the result of my analysis.
Subsequently, lets check the number of null values to decide if the zeros should just be set as ull or completely droped out.
First for the budget zero values
df_budget_0count = df.groupby('budget').count()['id']
df_budget_0count.head(2)
As suggested by the results, there are a lot of zero values than non zero values. Dropping them out would corrupt the results. I better set them as null instead.
Then for the revenue zero values
df_revenue_0count = df.groupby('revenue').count()['id']
df_revenue_0count.head(2)
Same situation. Set to null
Finally for the runtime
The number of zeroes is negligible, they can be droupped out
Summary
Remove some columns with a lot of null values and unnecessary ones for answering the questions : homepage, tagline, imdb_id, overview, budget_adj, revenue_adj.
Remove duplicated data
Remove all null values in the columns that have null values
Replace zero values with null values in the budget and revenue column.
Drop the lines with runtime == 0.
The first part ends here. If you had some good time reading this, kindly check the second part which is about data cleaning.
Thank you for reading
Posted on March 11, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024