Implementing the K Nearest Neighbors algorithm from scratch in Python
Fortune Adekogbe
Posted on March 2, 2021
Machine learning to most is a black box technique. This means that the inputs and outputs are known but the process is not fully understood. However, I think that for most classical algorithms this should not be the case. This piece aims to help you learn to implement the K Nearest Neighbor algorithm in Python.
The principle behind nearest neighbor methods, in general, is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples, in this case, is a user-defined constant.
Despite its simplicity, nearest neighbor methods have been successful in a large number of classification and regression problems, including handwritten digits and satellite image scenes.
The optimal choice of the value "k" is dependent on the data but in general, a larger value suppresses the effects of noise but makes the classification boundaries less distinct. Enough of the talking, lets begin...
import numpy as np
from scipy.spatial.distance import euclidean
from sklearn.base import BaseEstimator
class KNNBase(BaseEstimator):
...
Firstly, we import the relevant modules which are numpy
, euclidean
and BaseEstimator
. Euclidean determines how distance is calculated while BaseEstimator
is the base class for all estimators in sklearn.
The KNNBase
class thus takes this class as its parent and inherits its methods. This then becomes the base class for the K Nearest Neighbors classifier and regressor.
class KNNBase(BaseEstimator):
def __init__(self, n_neighbors = 5, distance_func = euclidean):
self.n_neighbours = n_neighbors
self.distance_func = euclidean
def _fit(self, X, Y):
self.X = X
self.Y = Y
return self
The KNNBase class is initialized in the __init__
method with 2 parameters namely the number of neighbors and the distance function. The former denotes the number of neighbors and determines the prediction for a particular data point while the latter determines how the distance is calculated.
The default values are taken as 5 for n_neighbors
and euclidean
for distance_func
(any function from scipy.spatial.distance
will do for distance_func
).
The _fit
method takes in the training features X
and targets Y
and stores them as instance variables. It then returns the now updated instance via self
.
def _vote(self, neighbors_targets):
raise NotImplementedError()
def _predict(self, X_test):
Y_pred = np.empty(X_test.shape[0])
for index, entry in enumerate(X_test):
distances = [self.distance_func(i, entry) for i in self.X]
neighbors = np.argsort(distances)[: self.n_neighbors]
neighbors_prediction = [self.Y[j] for j in neighbors]
Y_pred[index] = self._vote(neighbors_prediction)
return Y_pred
The _vote
method is a helper function that is not implemented in KNNBase
class since it will be different for the classifier and regressor. However, its aim is to return the single prediction for a data point given its neighbor's targets.
The _predict
method is the powerhouse of the algorithm and it takes in the data points X_test
who's targets are to be predicted. The first thing we do is instantiate an empty array with the same number of elements as there are data point in X_test
.
Next up, we loop through X_test
and for each data point, the distances from every training example are calculated with the euclidean function.
These distances are then sorted with the np.argsort
function and this returns the sorted array but instead of the values themselves as elements, it returns their indexes. This is useful to us because we need the indexes to get the corresponding predicted values for the nearest neighbors.
The first k neighbors are sliced out of the sorted array and the targets at those points stored in neighbors_prediction
. A vote is made using these values by calling on the self._vote
method and the result is stored by filling up the once empty array Y_pred
.
The complete KNNBase
class is thus:
class KNNBase(BaseEstimator):
def __init__(self, n_neighbors = 5, distance_func = euclidean):
self.n_neighbors = n_neighbors
self.distance_func = euclidean
def fit(self, X, Y):
self.X = X
self.Y = Y
return self
def _vote(self, neighbors_targets):
raise NotImplementedError()
def predict(self, X_test):
"""
Predict the targets of the test data.
"""
Y_pred = np.empty(X_test.shape[0])
for index, entry in enumerate(X_test):
distances = [self.distance_func(i,entry) for i in self.X]
neighbors = np.argsort(distances)[:self.n_neighbors]
neighbors_prediction = [self.Y[j] for j in neighbors]
Y_pred[index] = self._vote(neighbors_prediction)
return Y_pred
The Classifier
class KNNClassifier(KNNBase):
def _vote(self, neighbors_target):
count_ = np.bincount(neighbors_target)
return np.argmax(count_)
This class inherits from the KNNBase
class and implements the _vote
method. In the _vote
method, we take in the neighbors_target
list and use the np.bincount
method to generate the frequency of each class in the neighbors_target
list. Then we return the argument with the highest value with np.argmax
. If there is a tie for the most common label among the neighbors, then the predicted label is arbitrary.
The Regressor
class KNNRegressor(KNNBase):
def _vote(self, neighbors_target):
return np.mean(neighbors_target)
This class also inherits from the KNNBase class and implements the _vote
method. However, in the _vote
method, we take in the neighbors_target
list and use the np.mean
function to find and return the mean of all its values.
To be sure that our algorithm works, we will compare its result on the Boston housing dataset with that of the sklearn implementation.
from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
X, y = load_boston(return_X_y = True)
model = KNNRegressor().fit(X,y)
pred= model.predict(X[:10])
print("For the first 10 data points,")
print(f"Our implementation gives: {pred}")
model1= KNeighborsRegressor().fit(X,y)
pred1= model1.predict(X[:10])
print(f"The Sklearn Implementation gives: {pred1}")
For the first 10 data points, Our implementation gives: [21.78 22.9 25.36 26.06 27.1 27.1 20.88 19.1 18.4 19.48] The Sklearn Implementation gives: [21.78 22.9 25.36 26.06 27.1 27.1 20.88 19.1 18.4 19.48]```
We have now successfully implemented the K-Nearest Neighbour algorithm. Feel free to test it on any other dataset.
I hope you enjoyed reading this. If you have any questions, you can contact me on Twitter @phortz.
References:
Posted on March 2, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
January 19, 2023