Unsupervised Machine Learning: Non Text Clustering with DBSCAN

Introduction

Unsupervised machine learning is a type of machine learning where the model is not provided with labeled training data. Instead, it must find patterns or relationships in the data on its own. There are different types of unsupervised learning, such as clustering and dimensionality reduction. Clustering, in particular, is the task of grouping similar examples together, without being provided with a specific target variable.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Unlike other clustering algorithms, such as k-means, DBSCAN does not require the number of clusters to be specified in advance. Instead, it automatically discovers the number of clusters based on the density of the data. DBSCAN works by defining a dense region as one where there are at least a specified number of examples within a certain distance (epsilon) of each other. These dense regions are then used as clusters.

The DBSCAN algorithm is implemented using the DBSCAN class from the sklearn.cluster module. It takes two parameters epsilon, and min_samples. The epsilon defines the radius of the neighborhood around a point, and min_samples defines the number of points required to form a dense region. The fit method is applied on X to get the cluster labels. The variable labels contains the cluster labels for each data point.

In this article we would cover how to create cluster for non text data using DBSCAN, during the process the following libraries would be used:-

Sci-kit learn
Numpy
Matplotlib

The process

Importing Libraries and modules

from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt

Defining model and data configurations

num_samples_total = 1000
cluster_centers = [(3,3), (7,7)]
num_classes = len(cluster_centers)
epsilon = 1.0
min_samples = 13

Generating training data

X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5)

Saving the data for future uses and loading

np.save('./clusters.npy', X)
X = np.load('./clusters.npy')

Training the model

db = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X)
labels = db.labels_

Getting information about the clusters

no_clusters = len(np.unique(labels) )
no_noise = np.sum(np.array(labels) == -1, axis=0)

print('Estimated no. of clusters: %d' % no_clusters)
print('Estimated no. of noise points: %d' % no_noise)

Visualizing the clusters

colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', labels))
plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True)
plt.title('Two clusters with data')
plt.xlabel('Axis X[0]')
plt.ylabel('Axis X[1]')
plt.show()

That's all, I hope this helps!!

Do you have a project 🚀 that you want me to assist you email me🤝😊: wilbertmisingo@gmail.com
Have a question or wanna be the first to know about my posts:-
Follow ✅ me on Twitter/X 𝕏
Follow ✅ me on LinkedIn 💼

Blog

Unsupervised Machine Learning: Non Text Clustering with DBSCAN

Wilbert Misingo

Join Our Newsletter. No Spam, Only the good stuff.

Related