K-Means Clustering: Part 3 of 3
Theoretically and mathematically, we have understood a great deal about K-Means Clustering through Part 1 and Part 2 of this series. If you have not read them, I recommend that you read them prior to going through this blog post.
So, far we have just seen the basics of K-Means algorithms. They certainly help in the unsupervised clustering of data. However, we must realise that not all of this is completely foolproof.
There are a lot of decisions and conscious considerations to come up with useful clusters. Let us look at some of the important ones.
Consideration 1: K value
We have been saying that we will start with "K" clusters. But how do we decide what is the right value for K? Should we create 2, 3, 5, 10 clusters? What is the right number for K?
The first thing that comes to our mind is to look at scatter plots like the one here and it will be obvious as to how many clusters exist! Isn't it?
Yes, when we have a plot like the one here and we see the intuitive number of clusters, it is very easy to say that the data has 3 clusters and so K should be 3. In reality, however, data is rarely just 2 dimensional or even just 3 dimensional. The moment we move to data with multiple dimensions, visual representation is always not possible and we have to have other means to decide the number of clusters that we want to create.
So, what we are saying here is that we want to find out the natural number of clusters that already exist in my data without visually being able to see it. That is one criterion for sure, to decide on K.
For this, there may be many methods that aid you but two of those methods I would like to mention here: silhouette analysis and elbow method. They help in coming up with the right number of clusters using a quantitative method. They indicate the natural or intuitive number of clusters that exist in the data.
However, apart from the above mentioned qualitative methods, the business or domain knowledge would also have to be used to decide the K or the number of clusters. Even if silhouette analysis says 3 clusters exist, you might believe that it makes sense to have 4 clusters based on your business experience and knowledge. So, you could go with 4 and see you get the benefits of clustering from that.
Consideration 2: The initial centroids
Again, we have so far randomly chosen the initial cluster centres. If we choose the initial centroids randomly, we should be aware that we may not always end up with the same set of clusters!! Don't believe me?
Let us take an example data set and a random set of starting centroids. I chose completely different centroids to start with for the below data and the clusters I obtained were very different each time. The pink and blue are the clusters obtained each time I started with different random centroids.
This was obtained using https://www.naftaliharris.com/blog/visualizing-k-means-clustering/, which provides a very good simulation for the k-means algorithm with various sample data.
So you see, in certain types of data, the initial centres can have an impact on the clusters formed later. The clusters can keep varying. Hence, the initial centroids have to be chosen wisely.
So, what is the criteria or intelligence to be used to decide the right set of centroids - to begin with? One of the standard ways of choosing the initial centroid points is by an algorithm called the K-Means++ algorithm. (which probably we can address in some later blog)
At a very high level, I can summarise what this algorithm does is to help you pick up the farthest points possible as the initial centroids - again through distance calculations. And this is quite an intensive process if the data set is large.
Consideration 3: Clusterability of the data
From the above example scatter plot, you would have wondered if there are any natural clusters at all in the given data. In fact, that is uniformly spread, random data and hence is not suitable for forming clusters. So, before we jump into clustering any data, we have to check for what is called the "Cluster tendency" of the data. If the data shows no cluster tendency, there is no point in trying to cluster the data. It will be a futile effort. So, how do you check the Cluster tendency?
The Cluster tendency is given by a statistic known as the "Hopkins statistic". If the value of the Hopkins statistic is closer to 1, then, the data is said to be clusterable and if it is around point 0.5, the data is uniformly distributed - as in the above example with no cluster tendency. If the statistic is 0, then the data is in a grid format, again with no possibility of meaningful clusters. Hence you would look for a Hopkins statistic that is close to 1 before we embark on clustering.
Consideration 4: Impact of Outliers
For most algorithms, we know outliers play havoc unless treated. So it is in the case of K-Means clustering too. We must recollect that K-Means is heavily dependent on the calculation of means. And an average or mean is always impacted by outliers.
If there are outliers, they tend to pull the cluster away from its natural centre and sometimes probably bring along points that naturally belong to another cluster. This cannot be understood, till we treat or remove the outliers. The farther the outlier, the more impact it has on the homogeneity of the cluster and hence the formation of the right clusters.
Take a look at Figures 1 and 2. Figure 1 has no outliers. Figure 2 has one outlier at (60,2). You can notice that the natural clusters are so well-formed in the first case while the clusters are totally skewed in the second case. All of the data looks like 1 cluster and the outlier on its own as another, though it takes 2 more data points with it to form the 2nd cluster. Clearly, the intra-cluster homogeneity is very low here and the inter-cluster heterogeneity is also low, defeating the aim of clustering itself.
Hence it is quite imperative to treat outliers before clustering of data is undertaken. The various methods that can be employed to treat outliers are already discussed in an earlier blog of mine on the Treatment of Outliers.
Consideration 5: Data points in largely different scales
Data on different scales is another aspect that impacts clustering considerably. Recollect that this algorithm works on distances. If some data is on a very large scale and some of the other data is on a much smaller scale, the distance calculation would be dominated by the dimension that is on a large scale and again impacts the cluster formation.
Though I am not getting into examples here, it is best to standardise all the data to a common scale. It means to have a 0 mean and 1 standard deviation to ensure that equal weightage is given to all variables in the cluster formation. Their spread and other characteristics remain the same but are scaled to a comparable scale. You can look up the various methods of feature scaling, as it is called in my earlier blog on the importance and methods employable for feature scaling.
Consideration 6: Categorical Variables
The K-Means algorithm itself works on the concept of distance calculations and hence would not work with categorical variables. You might have to look at algorithms like the K-Mode algorithm for such data.
K-Means works well only with spherical clusters and not with any other type of clusters that are non-spherical. For example, if there are two cylindrical data points parallel to each other. because it is based on distance minimisation, the clusters would be ill-formed in this case.
It does not have the ability to learn by itself what the right K-value needs to be. this has to be given as input to it.
It will go ahead and cluster any type of data even though there are no clusters naturally available. For example, even if we gave uniformly distributed data, it went ahead and created 2 clusters in the above example in Consideration 2.
K-Means also gives more weightage to bigger clusters compared to smaller ones.
K-Means is an easy-to-understand algorithm and even easy to implement as long as all the practical considerations, discussed are kept in mind. It has two main steps of assignment and optimisation that are iteratively applied on the given data points till the centroids do not change anymore. A very simple and yet very powerful and useful algorithm in the unsupervised algorithm category.
We also looked at the mathematical representation of each of these steps and the cost function that we are missing here. This should give a good theoretical understanding of the K-Means algorithm.
Also, it has a few drawbacks that need to be kept in mind before using it for clustering. I hope to take you through an example with code in some later posts.