Investigating the TMDB movie dataset, part 2
nkpremices
Posted on March 12, 2020
This blog post is the second part of a whole series. I would recommend you read the first part if you want to understand this one.
In this blog post, I am going to talk about data cleaning. we are going to use the results that we got in the first part and build from there.
Data Cleaning
Step 1. Remove some columns with a lot of null values.
df.drop(['imdb_id', 'homepage', 'tagline', 'overview', 'budget_adj', 'revenue_adj'], axis=1, inplace=True)
df.head(1)
Step 2. Remove duplicated data
df.drop_duplicates(inplace=True)
Step 3. Remove all null values in the columns that have null values
df.dropna(subset = ['cast', 'director', 'genres'], how='any', inplace=True)
Let's check if there are still null values
df.isnull().sum()
Step 4. Replace zero values with null values in the budget and revenue column.
df['budget'] = df['budget'].replace(0, np.NaN)
df['revenue'] = df['revenue'].replace(0, np.NaN)
df.info()
Step 5. Drop the runtime column.
df.query('runtime != 0', inplace=True)
df.query('runtime == 0')
df.info()
df.describe()
From the table above, we can see that replacing the zeros by null values in the budget and revenue distribution made them look better. We can also see that the minimum makes now more sense
This is the end of the second part. If you got some good time reading, stay tuned. I will post the third part soon.
thank you for reading.
Posted on March 12, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024