K-Means Clustering: Part 2 of 3
Last week, we looked at the basic understanding of how K-Means Clustering works through the 5-step process where the two steps of assignment and optimisation are done iteratively to get stable clusters. For those who have not read it, I would suggest that you read K-Means Clustering: Part 1 of 3 before you proceed further.
We know in Machine Learning, we need to understand what is the cost function that any algorithm is trying to work with, so that we can either minimise or maximise it. And further, automate it.
In this part 2, we would look at the mathematical representations of the cost function and the two steps of assignment and optimisation.
Straight away, I would start with the cost function represented by J and then explain it. The cost function is:
where K represents the number of clusters and “k” would be the cluster number to which the data point “i” belongs. k would range from 1 to K. Therefore for every “i” in cluster “k”, the Euclidean distance (square of the difference) between the ith data point and the centroid (represented by the mean mu-k) is taken and summed over all the points in that cluster. This is repeated for every cluster in the K clusters and the summation is done over all the K clusters.
Therefore, this is represented as two summations. One summation is the sum of all distances of points within one cluster from the centre. The next level summation is the sum of all the distances across all clusters.
Hence, J stands for the total squared error across all the clusters w.r.t their own centroids and that is what we want to minimize. If the clusters are very tight and good, then the overall cost function value will be very low. And that is what we are aiming for. In layman terms, that would mean that all the data points of a cluster are very close to the centroid.
Let us look at the similar mathematical representations of the 2 important steps: the assignment step and the optimization step.
The Assignment Step:
This is the step where the distance between the centroid and every data point is calculated. This can be represented as (a function of the points in a cluster and the cluster mean):
This distance is calculated K times for each point if there are K clusters and hence K cluster centroids. So, if we have 10 data points and 3 cluster centroids, we start with data point 1 and calculate 3 distances of point 1 from the 3 cluster centres. And the minimum distance among them is taken and the point 'i' (which is 1 in this case) is assigned to that cluster, say cluster 2. Likewise, we take data point 2 and then calculate the 3 distances from the 3 centroid points and assign data point 2 to the cluster which has a minimum distance from this data point.
So, what we are trying to do here is calculate the minimum (argmin) of the distance of every point with every cluster centre, shown as
This is repeated for every data point and the data point is assigned to the cluster whose centre it has a minimum distance with. In the above example, we would have calculated 10 points x 3 centroids = 30 distances before we have assigned all the 10 points to 3 clusters.
The Optimization Step:
Now, having created the clusters based on the distance calculations, you recompute the cluster centre mu-k for K clusters. You calculate the mean or the cluster centre as shown here:
Where n-k is the number of data points that belong to the kth cluster and x-i are all the points that belong to the kth cluster. The mean is taken for only those data points that belong to that cluster. So, if cluster 1 had 5 points, the mean of all these 5 points is taken to come up with the new centroids. Note that the mean is taken for each dimension of the data point. If the data point is presented by x,y,z, then the mean is individually taken for x, y and z respectively to calculate the new centroid represented probably by its own x,y,z values.
This formula must look very familiar. it is nothing but the formula for mean.
This is what is repeatedly done for all data points of every cluster to get the new cluster centres of each cluster. In the example taken above, the means are found for all the 3 clusters and 3 news cluster centres are calculated.
The above two steps are repeated till we reach a point where the centroids do not move any further. That determines our stopping point of the algorithm and the clusters so formed are the most stable clusters.
These formulae can be used in an excel sheet, to begin with, along with the data points shared in Part 1 of this article and you can try it out on your own.
In Part 2, we have understood the mathematics that is used in K-Means. Nothing too difficult. However, there are quite a few practical considerations that impact the usefulness of the K-Means clustering. Each of those will be talked about in Part 3 of this series.