K-Means Clustering: A Step-by-Step Guide🤖📊
Anand
Posted on November 20, 2024
Hello, Data Enthusiasts! 👋
When diving into the world of Unsupervised Learning, we encounter tasks where we aim to find hidden patterns in data without having explicit labels. One of the most popular techniques for such tasks is Clustering. Today, let’s look at K-Means Clustering and how we can implement it with a hands-on Python example! 🚀
What is Unsupervised Learning? 🤔
Unsupervised learning is a type of machine learning where the model is provided with data that has no labels. The goal here is to uncover patterns, structures, or relationships within the data. The model tries to learn from the input data without any guidance on what the output should be.
Examples include clustering, anomaly detection, and dimensionality reduction.
What is Clustering? 🧑🤝🧑
Clustering is an unsupervised learning technique that groups data points based on their similarities. The most common clustering algorithm is K-Means, where the "K" represents the number of clusters you want to divide your data into.
K-Means Clustering Algorithm: Steps 📝
- Initialize centroids: Choose K random points in the data as the initial centroids (cluster centers).
- Assign data points to clusters: Each data point is assigned to the nearest centroid, creating K clusters.
- Update centroids: After assignment, calculate the new centroids based on the mean of the data points in each cluster.
- Repeat: Steps 2 and 3 are repeated until the centroids no longer change or converge.
Now, Let’s Dive Into the Code 💻
Here’s an example of implementing K-Means Clustering using Python. I'll walk you through every step and explain what's happening at each stage!
Step 1: Importing Libraries 🧑🔬
import pandas
import numpy
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')
We start by importing essential libraries.
-
matplotlib
andseaborn
are for visualizing the data. -
pandas
andnumpy
help us handle data. -
filterwarnings
is used to suppress any warnings in our code.
Step 2: Creating Synthetic Data 💡
from sklearn.datasets import make_blobs
x, y = make_blobs(n_samples=1000, centers=3, n_features=2)
plt.scatter(x[:, 0], x[:, 1], c=y)
Here, we generate a synthetic dataset with 1000 samples and 3 centers (clusters).
-
make_blobs
creates 2D data with separable clusters. -
plt.scatter
helps us visualize the data points, with colors indicating the actual clusters.
Output:
A scatter plot shows 3 distinct clusters.
Step 3: Standardizing the Data 🔄
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
- Standardization is crucial in K-Means as it ensures that all features have a similar scale.
-
train_test_split
splits the data into training and testing sets. -
StandardScaler
normalizes the data to a common scale.
Step 4: Elbow Method for Optimal K 🧩
from sklearn.cluster import KMeans
wscc = []
for k in range(1, 11):
kmean = KMeans(n_clusters=k, init='k-means++')
kmean.fit(x_train_scaled, y_train)
wscc.append(kmean.inertia_)
In this part, we use the Elbow Method to determine the optimal number of clusters (K). The inertia measures how well the data fits into the clusters.
-
KMeans
fits the data for different values of K (from 1 to 10). - We store the inertia values in the list
wscc
to evaluate the "elbow."
Output (wscc
):
[1499.99, 594.74, 65.69, 58.75, 51.76, 41.96, 37.2, 34.92, 29.05, 27.66]
Step 5: Plotting the Elbow Curve 📉
plt.plot(range(1, 11), wscc)
plt.xticks(range(1, 11))
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
The Elbow Curve helps us determine the best K.
- As K increases, inertia decreases.
- The "elbow" is where the decrease in inertia starts to slow down, suggesting the optimal K.
Step 6: Knee Locator for K Value
from kneed import KneeLocator
kl = KneeLocator(range(1, 11), wscc, curve='convex', direction='decreasing')
kl.elbow
Using KneeLocator, we can find the "elbow" point in the curve.
- The
elbow
method gives us the optimal K based on the inertia curve.
Output:
3
Thus, the optimal number of clusters is 3! 🎉
Step 7: Silhouette Score for Validation 🌟
from sklearn.metrics import silhouette_score
silhouette_coefficients = []
for k in range(2, 11):
kmean = KMeans(n_clusters=k, init='k-means++')
kmean.fit(x_train_scaled)
score = silhouette_score(x_train_scaled, kmean.labels_)
silhouette_coefficients.append(score)
plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficients')
The Silhouette Score measures how similar data points within a cluster are, compared to points in other clusters.
- Higher scores indicate well-separated clusters.
Conclusion: K-Means in Action! 🚀
- K-Means Clustering is a powerful technique to group similar data points into K clusters.
- The Elbow Method and Silhouette Score are effective for determining the optimal K.
- By using
scikit-learn
, you can easily implement K-Means, visualize results, and evaluate the quality of your clustering.
So, next time you're working with unsupervised data, try K-Means Clustering and see how well it works for your dataset! 😎
Happy clustering! 🎉
Posted on November 20, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 27, 2024