7 Essential Techniques for Data Preprocessing Using Python: A Guide for Data Scientists

Data preprocessing is an important step in data science. Data preprocessing means cleaning, transforming and preparing raw data for analyzing. For this purpose Python programming language is used. Because python has inbuilt libraries and tools for data science and machine learning. In this blog we will see some steps in data preprocessing.

Below are some steps anyone (even beginner can understand) can use in their practice or learning.

1. Importing Libraries :

First step in any project is importing the necessary libraries, which will be used in entire code. Below are some common libraries.

import numpy as np
import pandas as pd
import matplot.pyplot as plt
import seaborn as sns

2. Loading Data :

Next step is to load the data to process. In python pandas is a library which used for this purpose. Pandas is a very powerful library to load the data and process it. To load the data from csv file there is function named read_csv.

df = pd.read_csv("data.csv")

3. Handling Missing Values :

When we work on real world data, there might be some missing values in files. When there are missing values in data then any algorithm can't work on that data. So this step is most important in data preprocessing. So we need to handle all missing values before performing any analysis. pandas includes various function/methods to handle missing values. There is a function names isnull which provides information about missing values from data.

missing_values = df.isnull()

For filling some value in the place of missing value fillna function can be used.

df.fillna(0,inplace=True)

here inplace means values will be overwritten in dataset directly

4. Handling outliers :

There may be some data points in dataset which may be totally different from other data points. These datapoints are known as outliers. This process can be called as finding the odd one out. There is function in seaborn library named Boxplot which can be used to visualize the distribution of data and identify the outliers.

sns.boxplot(x=df['column_name'])

z_score is used to identify and remove the outliers. z_score is the measurement of standard deviation of a data point from the mean. this function is inside the scipy library.

from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

To more information on outliers Click Here

5. Encoding the Categorical variables :

Categorical variables means that data which shows categories or some memberships. Mostly this data can be in the form of string or characters. But all machine learning models works on only numerical data, so that's why we have to encode these variables in numerical format.

There are several methods for encoding categorical variables.
we will discuss get_dummies method from pandas library.
This method creates dummy variables for categorical data. Basically it converts categorical data into 0/1 on the basis of categories from that column.

df = pd.get_dummies(df,columns=['column_name'],dtype=float)

Demo :

For more Click Here

6. Feature Scaling :

To work the model smoothly data should be in some range. Here comes the feature scaling. Using feature scaling data points can be brought in some scale, like 0-1. There is method named MinMaxScaler in sklearn library, which can scale the features between 0 to 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['column_name']] = scaler.fit_transform(df[['column_name']])

Above code scales the data points from given column between 0 to 1.

7. Feature Selection :

Feature selection involves choosing the most relevant features from a dataset, aiming to enhance the accuracy and efficiency of machine learning models. sklearn library provides the function named SelectKBest which selects the top K features from the dataset using statistical tests. By opting out for most relevant features, we can optimize the performance of our models in terms of speed and accuracy.

from sklearn.feature_selection import SelectKBest, chi2
x = df.drop('target_column', axis=1)
y = df['target_column']

selection = SelectKBest(chi2, k=3)
x_new = selection.fit_transform(x,y)

here chi2 means chi-squared value which indicates which features is most important. Here if chi2 is higher then that feature will be selected. And k=3 means number of features to select.

For more information of chi2 - Click Here

Conclusion :

In conclusion, we can say that data pre-processing plays important role in data science/Machine Learning. In this post we explored some fundamental techniques for data preprocessing using python. By applying these techniques, we can clean, transform and prepare raw data for further analysis and modeling.

Blog