Deciphering Standardization and Normalization: Understanding Feature Scaling Techniques

esakik

Koki Esaki

Posted on February 3, 2024

Deciphering Standardization and Normalization: Understanding Feature Scaling Techniques

Importance of Feature Scaling

Machine learning algorithms, such as linear regressions and neural networks, work better or converge faster when the features are on a similar scale, and standardization makes the scale of the features similar.

For example, when considering features like age and income, your model may prioritize income over age due to the significant difference in the scale of values.

Feature Engineering

Standardization (Z-score normalization)

Standardization rescales the feature of a dataset so that they have a mean of 0 and a standard deviation (SD) of 1. This feature scaling technique is achieved by subtracting the average value of the feature from respective feature and then dividing by the standard deviation.

The formula for standardization is:

xi=ximean(x)SD(x) x_i = \frac{x_i - mean(x)}{SD(x)}

It is less affected by outliers than normalization. Therefore, this method often used when the maximum and minimum values are not fixed or when outliers exist.

from sklearn import preprocessing
import numpy as np


X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
print(X_scaled)
Enter fullscreen mode Exit fullscreen mode
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])
Enter fullscreen mode Exit fullscreen mode

Normalization (Min-Max scaling)

Normalization scales the features of a dataset to a specific range, typically between 0 and 1. This is achived by subtracting the minimum value of the feature from respective feature and then dividing by the range.

The formula for normalization is:

xi=ximin(x)max(x)min(x) x_i = \frac{x_i - min(x)}{max(x) - min(x)}
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
print(X_train_minmax)
Enter fullscreen mode Exit fullscreen mode
array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])
Enter fullscreen mode Exit fullscreen mode

Implementations from Scratch

First, we will import the necessary libraries, load the dataset, and use the two features from the Iris dataset for the demonstration.

pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.7.4
Enter fullscreen mode Exit fullscreen mode
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris


iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
X = data.iloc[:, 2:]
Enter fullscreen mode Exit fullscreen mode

Standardization takes the mean as zero and the variance as one. The following code demonstrates how to standardize the dataset.

def standardize(X):
    return (X - np.mean(X, axis=0)) / np.std(X, axis=0)


X_std = standardize(X)
Enter fullscreen mode Exit fullscreen mode

Normalization is a 0-1 scaling method where the minimum value is 0 and the maximum value is 1. The following code shows how to normalize the dataset.

def normalize(X):
    return (X - np.min(X, axis=0)) / (np.max(X, axis=0) - np.min(X, axis=0))


X_norm = normalize(X)
Enter fullscreen mode Exit fullscreen mode

The preprocessing results can be visualized using the following plotting method. The first plot shows the original dataset, the second plot shows the standardized dataset, and the third plot shows the normalized dataset.

import matplotlib.pyplot as plt


fig = plt.figure(figsize=(16, 12))

ax = fig.add_subplot(2, 2, 1)
ax.scatter(X.iloc[:, 0], X.iloc[:, 1])
ax.set_title("Before Standardization")
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")

ax = fig.add_subplot(2, 2, 3)
ax.scatter(X_std.iloc[:, 0], X_std.iloc[:, 1])
ax.set_title("After Standardization")
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")

ax = fig.add_subplot(2, 2, 4)
ax.scatter(X_norm.iloc[:, 0], X_norm.iloc[:, 1])
ax.set_title("After Normalization")
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")

plt.show()
Enter fullscreen mode Exit fullscreen mode

Feature Scaling

References

💖 💪 🙅 🚩
esakik
Koki Esaki

Posted on February 3, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related