K-Means Clustering: A Step-by-Step Guide🤖📊

kammarianand

Anand

Posted on November 20, 2024

K-Means Clustering: A Step-by-Step Guide🤖📊

Hello, Data Enthusiasts! 👋

When diving into the world of Unsupervised Learning, we encounter tasks where we aim to find hidden patterns in data without having explicit labels. One of the most popular techniques for such tasks is Clustering. Today, let’s look at K-Means Clustering and how we can implement it with a hands-on Python example! 🚀


What is Unsupervised Learning? 🤔

Unsupervised learning is a type of machine learning where the model is provided with data that has no labels. The goal here is to uncover patterns, structures, or relationships within the data. The model tries to learn from the input data without any guidance on what the output should be.

Examples include clustering, anomaly detection, and dimensionality reduction.


What is Clustering? 🧑‍🤝‍🧑

Clustering is an unsupervised learning technique that groups data points based on their similarities. The most common clustering algorithm is K-Means, where the "K" represents the number of clusters you want to divide your data into.

Image description

K-Means Clustering Algorithm: Steps 📝

  1. Initialize centroids: Choose K random points in the data as the initial centroids (cluster centers).
  2. Assign data points to clusters: Each data point is assigned to the nearest centroid, creating K clusters.
  3. Update centroids: After assignment, calculate the new centroids based on the mean of the data points in each cluster.
  4. Repeat: Steps 2 and 3 are repeated until the centroids no longer change or converge.

Now, Let’s Dive Into the Code 💻

Here’s an example of implementing K-Means Clustering using Python. I'll walk you through every step and explain what's happening at each stage!


Step 1: Importing Libraries 🧑‍🔬

import pandas 
import numpy 
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')
Enter fullscreen mode Exit fullscreen mode

We start by importing essential libraries.

  • matplotlib and seaborn are for visualizing the data.
  • pandas and numpy help us handle data.
  • filterwarnings is used to suppress any warnings in our code.

Step 2: Creating Synthetic Data 💡

from sklearn.datasets import make_blobs
x, y = make_blobs(n_samples=1000, centers=3, n_features=2)
plt.scatter(x[:, 0], x[:, 1], c=y)
Enter fullscreen mode Exit fullscreen mode

description

Here, we generate a synthetic dataset with 1000 samples and 3 centers (clusters).

  • make_blobs creates 2D data with separable clusters.
  • plt.scatter helps us visualize the data points, with colors indicating the actual clusters.

Output:
A scatter plot shows 3 distinct clusters.


Step 3: Standardizing the Data 🔄

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
Enter fullscreen mode Exit fullscreen mode
  • Standardization is crucial in K-Means as it ensures that all features have a similar scale.
  • train_test_split splits the data into training and testing sets.
  • StandardScaler normalizes the data to a common scale.

Step 4: Elbow Method for Optimal K 🧩

from sklearn.cluster import KMeans
wscc = []

for k in range(1, 11):
    kmean = KMeans(n_clusters=k, init='k-means++')
    kmean.fit(x_train_scaled, y_train)
    wscc.append(kmean.inertia_)
Enter fullscreen mode Exit fullscreen mode

In this part, we use the Elbow Method to determine the optimal number of clusters (K). The inertia measures how well the data fits into the clusters.

  • KMeans fits the data for different values of K (from 1 to 10).
  • We store the inertia values in the list wscc to evaluate the "elbow."

Output (wscc):

[1499.99, 594.74, 65.69, 58.75, 51.76, 41.96, 37.2, 34.92, 29.05, 27.66]
Enter fullscreen mode Exit fullscreen mode

Step 5: Plotting the Elbow Curve 📉

plt.plot(range(1, 11), wscc)
plt.xticks(range(1, 11))
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
Enter fullscreen mode Exit fullscreen mode

Image description

The Elbow Curve helps us determine the best K.

  • As K increases, inertia decreases.
  • The "elbow" is where the decrease in inertia starts to slow down, suggesting the optimal K.

Step 6: Knee Locator for K Value

from kneed import KneeLocator
kl = KneeLocator(range(1, 11), wscc, curve='convex', direction='decreasing')
kl.elbow
Enter fullscreen mode Exit fullscreen mode

Using KneeLocator, we can find the "elbow" point in the curve.

  • The elbow method gives us the optimal K based on the inertia curve.

Output:

3
Enter fullscreen mode Exit fullscreen mode

Thus, the optimal number of clusters is 3! 🎉


Step 7: Silhouette Score for Validation 🌟

from sklearn.metrics import silhouette_score
silhouette_coefficients = []

for k in range(2, 11):
    kmean = KMeans(n_clusters=k, init='k-means++')
    kmean.fit(x_train_scaled)
    score = silhouette_score(x_train_scaled, kmean.labels_)
    silhouette_coefficients.append(score)

plt.plot(range(2, 11), silhouette_coefficients)
plt.xticks(range(2, 11))
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficients')
Enter fullscreen mode Exit fullscreen mode

Image description
The Silhouette Score measures how similar data points within a cluster are, compared to points in other clusters.

  • Higher scores indicate well-separated clusters.

Conclusion: K-Means in Action! 🚀

  • K-Means Clustering is a powerful technique to group similar data points into K clusters.
  • The Elbow Method and Silhouette Score are effective for determining the optimal K.
  • By using scikit-learn, you can easily implement K-Means, visualize results, and evaluate the quality of your clustering.

So, next time you're working with unsupervised data, try K-Means Clustering and see how well it works for your dataset! 😎


Happy clustering! 🎉


About Me:
🖇️LinkedIn
🧑‍💻GitHub

💖 💪 🙅 🚩
kammarianand
Anand

Posted on November 20, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related