K-Means Clustering: Part 1 of 3
Having looked at Clustering in general and also having heard that K-Means is one of the simplest and most popular clustering algorithms, let us take an example and walk through its workings.
We also need to understand its limitations, how to overcome them, its advantages and disadvantages.
I plan to explain the basics of K-Means clustering in a 3 part series. The first part, that is, this post, will take an example of very few data points and show how the clusters are formed.
The 2nd part to follow will talk about the cost function that is minimised in the K-Means algorithm with simple mathematical representations for the various steps.
The final 3rd part will talk about the practical considerations for the K-Means algorithm.
A Brief Recap
In K-Means the similarities between data points are decided based on distances between points and a centroid hence this is called a centroid-based clustering algorithm. If we want K clusters, we start with K centroids randomly chosen. Then, the distance of each point with these centroids is calculated and the points are associated with the centroid they are nearest to.
To recollect, the centroid of a set of points is another point having its own x and y coordinates that is the geometric centre of all the given points. This is calculated by taking the mean of all 'x' points to give the 'x' of the centroid and similarly average of all the 'y' points to get the 'y' point for the centroid. This is true, assuming that the set of given points have only two dimensions x,y. The same can be extended to n dimensions.
Getting into the Details
K-Means is an iterative algorithm that keeps creating new clusters with some adjustments till it finds a stable set of clusters and the points do not move from one to another.
The steps of this algorithm can be detailed as follows:
We start with 'K' random points as initial cluster centres.
Then, each point in the data set is assigned to one of these centres based on the minimum distance to the centre. (most often the Euclidean distance)
Once a cluster is formed, a new centroid is calculated for the cluster i.e. the mean of all the cluster members
Then, again distances are calculated for all the data points to the new centroids and the re-assignment happens based on minimum distance.
Steps 3 and 4 are repeated till the points do not move any further from one cluster to another nor do the centroids move too much.
Steps through an Example
Let us take an example set of data points and see how this is done.
For the above data set, we take 2 centroid points at (10,8) and (18,6) randomly, to begin with, keeping in mind the range of the data points. These are represented by the yellow points in Figure 1.
Figure 1: Two clusters are formed around random centroids
Based on the Euclidean distance from all the points to these 2 centroids, two clusters have been formed in red and green. 7 points in the red cluster and 4 points in the green cluster. This is called the assignment step.
The formula for Euclidean distance is given by
Once these clusters were formed, we realise that the current centroids are not truly the geometric centres of their respective clusters. So, we find the new centroid by taking the geometric mean of all the x and y values of the 7 points in cluster 1. That value turns out to be (6.7, 6.1) as shown in figure 2.
Similarly, the new centroid is calculated for cluster 2 which shifts from (18,6) to (17.5,4.5), for the green points. This is called the optimization step.
Figure 2: Centroids recalculated and clusters reassigned in Step 3
Since the centroids have shifted, now it is worth recalculating the distance of all the points with respect to the new centroids. So, we calculate the euclidean distance of all the points with the new centroids again. We see that one of the earlier red points is now closer to centroid 2 or the green cluster's centroid and hence has been reassigned at the end of this cycle to cluster 2.
Figure 3: Newly calculated centroids and clusters - a second time
We repeat steps 3 and 4 (the assignment and the optimisation steps) and we get the plot as shown in Figure 3.
Here, centroid 1 has slightly shifted from (6.7, 6.1) to (5.6, 5.4) and centroid 2 has shifted from (17.5, 4.5) to (16.6, 5.6).
However, on recalculating the distances of all points with the new centroids, no points have moved from one cluster to another.
Figure 4: Newly calculated centroids and clusters a third time
We repeat the process one more time to see if the centroids move or the clusters change.
And voila, neither of them change as shown in Figure 4. It means that we have reached an equilibrium stage and that the clusters are stable.
This is how K-Means clustering works!
Just clustering around a centroid point through calculation of Euclidean distances of all data points to the centroid and keep readjusting the centroid till it moves no further. Looks very simple, isn't it? Very easily automatable and can be represented mathematically too.
K-means clustering is an unsupervised machine learning model that essentially works through 5 steps of which two steps of "assignment" of points to clusters and "optimization" of the cluster through calculating a new centroid is iteratively repeated till we reach stable clusters.
The stability of clusters is defined by the points not jumping across clusters and the centroids not changing significantly, in subsequent iterations. The core idea in clustering is to ensure intra-cluster homogeneity and inter-cluster heterogeneity to arrive at meaningful clusters. This has already been explained in my blog on "Introduction to Clustering"
In the next post, I will take you through the cost function for K-Means and a few mathematical formulae explaining the two important iterative steps. Till, then, see you :)