Introduction to Data Preprocessing

manavmodi

Manav Modi

Posted on January 21, 2022

Introduction to Data Preprocessing

What is Data Preprocessing?

Data Preprocessing comes right in after you have cleaned up your data and done some Exploratory Data Analysis. It is the step where we prepare the data for modeling. Modeling in Python needs numerical input.

Refreshing Pandas Skills

You can skip this section if you know the basics.

Before we proceed with the series, it is important to know the commands that can assist you in knowing your dataset well.

import pandas as pd
hiking = pd.read_json("datasets/hiking.json")
print(hiking.head())
Enter fullscreen mode Exit fullscreen mode

image.png

print(hiking.columns)
Enter fullscreen mode Exit fullscreen mode

image.png

print(hiking.dtypes)
Enter fullscreen mode Exit fullscreen mode

image.png

Removing Missing Data

Sample Data

image.png

Dropping rows with null values

print(df.dropna())
Enter fullscreen mode Exit fullscreen mode

image.png

Dropping specific rows from using an array

print(df.drop([1,2,3]))
Enter fullscreen mode Exit fullscreen mode

image.png

Dropping a specific column(here axis=1 specifies that column needs to be dropped.)

print(df.drop("A", axis=1))
Enter fullscreen mode Exit fullscreen mode

image.png

Fetching the not null rows from a specific column.

print(df[df["B"].notnull()])
Enter fullscreen mode Exit fullscreen mode

image.png

Working on DataTypes

While preprocessing the data, many times the datatype of columns is not as desired. We use the following command to convert the column datatype.

Remember: Always apply the datatype that fits all of the data in the particular column.

This code sample will help you convert column "C" to the float datatype.

df["C"] = df["C"].astype("float")
print(df.dtypes)

Enter fullscreen mode Exit fullscreen mode

Stratified Sampling

Train test split is done on the dataset for training and testing the model.
Say, the original dataset is 80% class 1 and 20% class 2. You would want a similar distribution in both train and test datasets to make sure you have the best representation.

 # Total "labels" counts
y["labels"].value_counts()
Enter fullscreen mode Exit fullscreen mode

image.png

X_train,X_test,y_train,y_test = train_test_split(X,y, stratify=y)
y_train["labels"].value_counts() 
y_test["labels"].value_counts()
Enter fullscreen mode Exit fullscreen mode

image.png

image.png

Check out the exercises linked to this here

Interested in Machine Learning content? Follow me on Twitter.

💖 💪 🙅 🚩
manavmodi
Manav Modi

Posted on January 21, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Introduction to Data Preprocessing
machinelearning Introduction to Data Preprocessing

January 21, 2022

Getting started with Tensorflow 2.0
machinelearning Getting started with Tensorflow 2.0

May 18, 2019