Unsupervised Machine Learning: Non Text Clustering with DBSCAN
Wilbert Misingo
Posted on January 12, 2023
Introduction
Unsupervised machine learning is a type of machine learning where the model is not provided with labeled training data. Instead, it must find patterns or relationships in the data on its own. There are different types of unsupervised learning, such as clustering and dimensionality reduction. Clustering, in particular, is the task of grouping similar examples together, without being provided with a specific target variable.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Unlike other clustering algorithms, such as k-means, DBSCAN does not require the number of clusters to be specified in advance. Instead, it automatically discovers the number of clusters based on the density of the data. DBSCAN works by defining a dense region as one where there are at least a specified number of examples within a certain distance (epsilon) of each other. These dense regions are then used as clusters.
The DBSCAN algorithm is implemented using the DBSCAN class from the sklearn.cluster module. It takes two parameters epsilon, and min_samples. The epsilon defines the radius of the neighborhood around a point, and min_samples defines the number of points required to form a dense region. The fit method is applied on X to get the cluster labels. The variable labels contains the cluster labels for each data point.
In this article we would cover how to create cluster for non text data using DBSCAN, during the process the following libraries would be used:-
Sci-kit learn
Numpy
Matplotlib
The process
- Importing Libraries and modules
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
- Defining model and data configurations
num_samples_total = 1000
cluster_centers = [(3,3), (7,7)]
num_classes = len(cluster_centers)
epsilon = 1.0
min_samples = 13
- Generating training data
X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5)
- Saving the data for future uses and loading
np.save('./clusters.npy', X)
X = np.load('./clusters.npy')
- Training the model
db = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X)
labels = db.labels_
- Getting information about the clusters
no_clusters = len(np.unique(labels) )
no_noise = np.sum(np.array(labels) == -1, axis=0)
print('Estimated no. of clusters: %d' % no_clusters)
print('Estimated no. of noise points: %d' % no_noise)
- Visualizing the clusters
colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', labels))
plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True)
plt.title('Two clusters with data')
plt.xlabel('Axis X[0]')
plt.ylabel('Axis X[1]')
plt.show()
That's all, I hope this helps!!
Do you have a project π that you want me to assist you email meπ€π: wilbertmisingo@gmail.com
Have a question or wanna be the first to know about my posts:-
Follow β
me on Twitter/X π
Follow β
me on LinkedIn πΌ
Posted on January 12, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.