Implementing the K Nearest Neighbors algorithm from scratch in Python

Machine learning to most is a black box technique. This means that the inputs and outputs are known but the process is not fully understood. However, I think that for most classical algorithms this should not be the case. This piece aims to help you learn to implement the K Nearest Neighbor algorithm in Python.

The principle behind nearest neighbor methods, in general, is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples, in this case, is a user-defined constant.

Despite its simplicity, nearest neighbor methods have been successful in a large number of classification and regression problems, including handwritten digits and satellite image scenes.

The optimal choice of the value "k" is dependent on the data but in general, a larger value suppresses the effects of noise but makes the classification boundaries less distinct. Enough of the talking, lets begin...

import numpy as np
from scipy.spatial.distance import euclidean
from sklearn.base import BaseEstimator

class KNNBase(BaseEstimator):
    ...

Firstly, we import the relevant modules which are numpy, euclidean and BaseEstimator. Euclidean determines how distance is calculated while BaseEstimator is the base class for all estimators in sklearn.

The KNNBase class thus takes this class as its parent and inherits its methods. This then becomes the base class for the K Nearest Neighbors classifier and regressor.

class KNNBase(BaseEstimator):
    def __init__(self, n_neighbors = 5, distance_func = euclidean):
        self.n_neighbours = n_neighbors
        self.distance_func = euclidean

    def _fit(self, X, Y):
      self.X = X
      self.Y = Y
      return self

The KNNBase class is initialized in the __init__ method with 2 parameters namely the number of neighbors and the distance function. The former denotes the number of neighbors and determines the prediction for a particular data point while the latter determines how the distance is calculated.

The default values are taken as 5 for n_neighbors and euclidean for distance_func (any function from scipy.spatial.distance will do for distance_func).

The _fit method takes in the training features X and targets Y and stores them as instance variables. It then returns the now updated instance via self.

    def _vote(self, neighbors_targets):
        raise NotImplementedError()

    def _predict(self, X_test):
        Y_pred = np.empty(X_test.shape[0])
        for index, entry in enumerate(X_test):
          distances = [self.distance_func(i, entry) for i in self.X]
          neighbors = np.argsort(distances)[: self.n_neighbors]

          neighbors_prediction = [self.Y[j] for j in neighbors]

          Y_pred[index] = self._vote(neighbors_prediction)
        return Y_pred

The _vote method is a helper function that is not implemented in KNNBase class since it will be different for the classifier and regressor. However, its aim is to return the single prediction for a data point given its neighbor's targets.

The _predict method is the powerhouse of the algorithm and it takes in the data points X_test who's targets are to be predicted. The first thing we do is instantiate an empty array with the same number of elements as there are data point in X_test.

Next up, we loop through X_test and for each data point, the distances from every training example are calculated with the euclidean function.

These distances are then sorted with the np.argsort function and this returns the sorted array but instead of the values themselves as elements, it returns their indexes. This is useful to us because we need the indexes to get the corresponding predicted values for the nearest neighbors.

The first k neighbors are sliced out of the sorted array and the targets at those points stored in neighbors_prediction. A vote is made using these values by calling on the self._vote method and the result is stored by filling up the once empty array Y_pred.

The complete KNNBase class is thus:

class KNNBase(BaseEstimator):
  def __init__(self, n_neighbors = 5, distance_func = euclidean):
    self.n_neighbors = n_neighbors
    self.distance_func = euclidean

  def fit(self, X, Y):
    self.X = X
    self.Y = Y
    return self

  def _vote(self, neighbors_targets):
    raise NotImplementedError()

  def predict(self, X_test):
    """
    Predict the targets of the test data.
    """

    Y_pred = np.empty(X_test.shape[0])
    for index, entry in enumerate(X_test):
      distances = [self.distance_func(i,entry) for i in self.X]
      neighbors = np.argsort(distances)[:self.n_neighbors]

      neighbors_prediction = [self.Y[j] for j in neighbors]

      Y_pred[index] = self._vote(neighbors_prediction)
    return Y_pred

The Classifier

class KNNClassifier(KNNBase):

  def _vote(self, neighbors_target):
    count_ = np.bincount(neighbors_target)
    return np.argmax(count_)

This class inherits from the KNNBase class and implements the _vote method. In the _vote method, we take in the neighbors_target list and use the np.bincount method to generate the frequency of each class in the neighbors_target list. Then we return the argument with the highest value with np.argmax. If there is a tie for the most common label among the neighbors, then the predicted label is arbitrary.

The Regressor

class KNNRegressor(KNNBase):
  def _vote(self, neighbors_target):
    return np.mean(neighbors_target)

This class also inherits from the KNNBase class and implements the _vote method. However, in the _vote method, we take in the neighbors_target list and use the np.mean function to find and return the mean of all its values.
To be sure that our algorithm works, we will compare its result on the Boston housing dataset with that of the sklearn implementation.

from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
X, y = load_boston(return_X_y = True)
model = KNNRegressor().fit(X,y)
pred= model.predict(X[:10])
print("For the first 10 data points,")
print(f"Our implementation gives: {pred}")

model1= KNeighborsRegressor().fit(X,y)
pred1= model1.predict(X[:10])
print(f"The Sklearn Implementation gives: {pred1}")

For the first 10 data points,
Our implementation gives: [21.78 22.9  25.36 26.06 27.1  27.1  20.88 19.1  18.4  19.48]
The Sklearn Implementation gives: [21.78 22.9  25.36 26.06 27.1  27.1  20.88 19.1  18.4  19.48]```

We have now successfully implemented the K-Nearest Neighbour algorithm. Feel free to test it on any other dataset.

I hope you enjoyed reading this. If you have any questions, you can contact me on Twitter @phortz.

Blog

Implementing the K Nearest Neighbors algorithm from scratch in Python

Fortune Adekogbe

The Classifier

The Regressor

References:

Join Our Newsletter. No Spam, Only the good stuff.

Related