Implementing Simple PCA using NumPy
Sıddık AÇIL
Posted on November 12, 2019
I am open to job offers, feel free to contact me for any vacancies abroad.
In this article, I will implement PCA algorithm from scratch using Python's NumPy. To test my results, I used PCA implementation of scikit-learn.
from sklearn.decomposition import PCA
import numpy as np
k = 1 # target dimension(s)
pca = PCA(k) # Create a new PCA instance
data = np.array([[0.5, 1], [0, 0]]) # 2x2 data matrix
print("Data: ", data)
print("Reduced: ", pca.fit_transform(data)) # fit and transform
This results in:
[[-0.55901699]
[ 0.55901699]]
Centering Data Points
Make origin the centroid of our data.
data = data - data.mean(axis=0) # Center data points
print("Centered Matrix: ", data)
Get Covariance Matrix
Get covariance matrix of our features.
cov = np.cov(data.T) / data.shape[0] # Get covariance matrix
print("Covariance matrix: ", cov)
Perform Eigendecomposition on Covariance Matrix
Eigendecomposition extracts eigenvalues and corresponding eigenvectors of a matrix
v, w = np.linalg.eig(cov)
Sort Eigenvectors According to Eigenvalues
Most numerical libraries offer eigenvectors pre-sorted, However, this is not the case for NumPy. Therefore, we need to argsort the eigenvalue vector to get sorting indices and perform sorting on columns of eigenvalues.
idx = v.argsort()[::-1] # Sort descending and get sorted indices
v = v[idx] # Use indices on eigv vector
w = w[:,idx] #
print("Eigenvalue vektoru: ", v)
print("Eigenvektorler: ", w)
Get First K Eigenvectors
Our aim in PCA is to construct a new feature space. Eigenvectors are the axes of this new feature space and eigenvalues denote the magnitude of variance along that axis. In other words, a higher eigenvalue means more variance on the corresponding principal axis. Therefore, the set of axes with the highest variances are the most important features in this new feature space since they hold most of the information. By getting the first K columns of eigenvector matrix, which have the K biggest eigenvalues, we form what is called a projection matrix. The dot product of our data matrix and projection matrix, which sounds pretty cool but it is actually pretty straightforward, is the reduced feature space, the result of PCA.
print("Sonuc: ", data.dot(w[:, :k]))
# Get the dot product of w with the first K columns of eigenvector matrix
# (a.k.a) projection matrix
Which also results in:
[[-0.55901699]
[ 0.55901699]]
This is it. Any corrections or suggestions are welcome.
Extras
Square vs. Non-Square: Eigendecomposition and SVD
Original Turkish article at
CPP Eigen ve Python Numpy ile PCA Gerçeklemesi | by Sıddık Açıl | Medium
Sıddık Açıl ・ ・ 4 min read
Medium
My go to article on PCA
Posted on November 12, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.