Cluster Analysis using K-means
Avinash Gupta
Posted on October 30, 2022
Introduction:
The k-means algorithm explores a preplanned number of clusters in an unlabeled multidimensional dataset, it concludes this via an easy interpretation of how an optimized cluster can be expressed.
Primarily the concept would be in two steps:
- First, the cluster center is the arithmetic mean (AM) of all the data points associated with the cluster.
- Second, each point is adjoint to its cluster center in comparison to other cluster centers. These two interpretations are the foundation of the k-means clustering model.
You can take the center as a data point that outlines the means of the cluster, also it might not possibly be a member of the dataset.
In simple terms, k-means clustering enables us to cluster the data into several groups by detecting the distinct categories of groups in the unlabeled datasets by itself, even without training data.
This is the centroid-based algorithm such that each cluster is connected to a centroid while following the objective to minimize the sum of distances between the data points and their corresponding clusters.
Specifically performing two tasks, the k-means algorithm:
- Calculates the correct value of K-center points or centroids by an iterative method.
- Assigns every data point to its nearest k-center, and the data points, closer to a particular k-center, make a cluster. Therefore, data points, in each cluster, have some similarities and are far apart from other clusters.
Explanation:
K-Means is just the Expectation-Maximization (EM) algorithm, It is a persuasive algorithm that exhibits a variety of contexts in data science, the E-M approach incorporates two parts in its procedure:
- To assume some cluster centers.
- Re-run as far as transformed.
- E-Steps: To appoint data points to the closest cluster center.
- M-Steps: To introduce the cluster centers to the mean.
Where the E-step is the Expectation step, it comprises upgrading forecasts of associating the data point with the respective cluster.
And, M-step is the Maximization step, which includes maximizing some features that specify the region of the cluster centers, for this maximization, is expressed by considering the mean of the data points of each cluster.
In account of some critical possibilities, each reiteration of E-step and M-step algorithms will always yield in terms of improved estimation of clustersโ characteristics.
K-means utilize an iterative procedure to yield its final clustering based on the number of predefined clusters, as per need according to the dataset and represented by the variable K.
For instance, if K is set to 3 (k3), then the dataset would be categorized into 3 clusters if k is equal to 4, then the number of clusters will be 4, and so on.
The fundamental aim is to define k centers, one for each cluster, these centers must be located in a sharp manner because of the various allocation causes different outcomes. So, it would be best to put them as far away as possible from each other.
Also, The maximum number of plausible clusters will be the same as the total number of observations/features present in the dataset.
Code:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn.decomposition import PCA
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
%matplotlib inline
RND_STATE = 55121
Loading Data:
data = pd.read_csv("data/tree_addhealth.csv")
data.columns = map(str.upper, data.columns)
data_clean = data.dropna()
cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1',
'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']]
cluster.describe()
Preprocessing Data:
clustervar=cluster.copy()
clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))
clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))
clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64'))
clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64'))
clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))
clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64'))
clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64'))
clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64'))
clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64'))
clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64'))
clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))
clus_train, clus_test = train_test_split(clustervar, test_size=0.3, random_state=RND_STATE)
K-means Analysis for 9 clusters:
clusters=range(1,10)
meandist=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(clus_train)
clusassign=model.predict(clus_train)
meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
/ clus_train.shape[0])
Relation between cluster and average distance:
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
plt.show()
Plotting Output of Relation:
Solution for 3 cluster model:
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()
clus_train.reset_index(level=0, inplace=True)
cluslist=list(clus_train['index'])
labels=list(model3.labels_)
newlist=dict(zip(cluslist, labels))
newclus=DataFrame.from_dict(newlist, orient='index')
newclus.columns = ['cluster']
newclus.describe()
Plotting Clusters:
newclus.reset_index(level=0, inplace=True)
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
merged_train.cluster.value_counts()
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
Output of Cluster Variable means:
gpa_data=data_clean['GPA1']
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=RND_STATE)
gpa_train1=pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['GPA1', 'cluster']].dropna()
gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())
print ('means for GPA by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)
print ('standard deviations for GPA by cluster')
m2= sub1.groupby('cluster').std()
print (m2)
mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())
Output for Comparison of means:
Posted on October 30, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.