Introduction to Clustering Algorithms
Updated: Sep 21, 2021
As mentioned in one of my introductory articles on "Machine Learning Algorithms Categories", Clustering Algorithms are a set of algorithms that help in discovering interesting patterns in data, such as customers with similar tastes, similar behaviours. It looks for natural groups based on the given input data, thus the name clustering.
This is an unsupervised learning technique where there is no notion of labelled output data or target data. An unsupervised method is a machine learning technique in which the models are not trained on a labelled dataset but discover hidden patterns without the need for human intervention.
A few unsupervised learning techniques apart from Clustering are Association and Dimensionality reduction.
Examples of Clustering
Let us look at a few examples in order to understand clustering better.
If you are given a news feed from various news channels or portals and if you had to categorise them as politics, sports, financial markets etc. without knowing upfront, what categories exist, then, this is a typical clustering application. It may turn out that there are very standard categories that appear over and over again. Though the algorithm cannot name them automatically as sports or politics, it can cluster all the sports articles into one cluster and the political articles as another.
However, in certain periods, totally new categories may turn up. Like during the Olympics, a totally new category may be discovered as '"Olympics news" or the "Paralympics news". Being able to discover and identity newer clusters as they form and be able to categorise them as such is also a part of Clustering.
Another very common example is that of customer segmentation. If a large retailer wants to create promotions or marketing strategies based on customer's behaviour, it would need to be able to categorise or segment its customers based on their behaviours or demographics and so on. One way of segmenting could be based on spends. Another could be based on the age group related shipping habits. Yet another could be based o location and its influence like beaches versus high altitudes. It could be based on loyalists versus coupon lovers. These should be gleaned from the customer data the retailer has. Then, their promotions can be very targeted and the conversion rate could be improved immensely.
Clustering is also heavily used in the medical field like human gene clustering and clustering of organisms of different species or within a species too.
Note that in each of the cases there are no labels attached to the data. It is after the clusters are formed that you can get actionable insights from the clusters that are created.
Types of Clustering
There are many types of clustering algorithms of which here are the top 4 well-known ones:
Each of them has its own characteristics with its own advantages and disadvantages. Today, I will provide a brief introduction to a couple of clustering algorithms
K-Means algorithms which is one of the well-known clustering algorithms is a centroid-based algorithm.
Hierarchical clustering is a connectivity-based clustering algorithm.
All clustering algorithms try to group data points based on similarities between the data. What does this actually mean?
It is often spoken of, in terms of inter-cluster heterogeneity and intra-cluster homogeneity.
Inter-cluster heterogeneity: This means that the clusters are as different from one another as possible. The characteristics of one cluster are very different from another cluster. This makes the clusters very stable and reliable. For example, if a cluster of customers is created based on highly populated areas versus thinly populated areas. If the difference in the population is distinct like in cities and villages, they turn out to be very stable and distinct clusters.
Intra-cluster homogeneity: This talks about how similar are the characteristics of all the data within the cluster. The more similar, the more cohesive is the cluster and hence more stable.
Hence the objective of clustering is to maximise the inter-cluster distance (Inter-cluster heterogeneity) and minimise the intra-cluster distance (intra-cluster homogeneity )
This is one of the most popular clustering algorithms and one of the easiest as well. Here, we are looking to create a pre-determined "K" number of clusters.
Here the similarities or lack of similarities between data points, is decided based on distances between points. The distance is measured from a centroid and hence this is called a centroid-based clustering algorithm. If we are starting with K clusters, we start with K centroids randomly chosen (or there is more science to it). Then, the distance of each point with these centroids is calculated and the points are associated with the centroid they are nearest to. Note that it would be ideal to have the centroids placed as far away as possible from each other as possible.
Thus clusters of data points are formed. The centres/centroids are recalculated and again the steps are repeated till the points don't seem to be moving from one cluster to another.
The distance formula used here is the Euclidean distance between every point and the centroids. The closer the points are to each other, the greater the chance of belonging to the same cluster.
Euclidean distance is very simple high school geometry. A very simple explanation of the same can be found here.
We will look at a practical example with data in a subsequent article but for now, we can summarise our understanding as that the clusters are formed based on the distances between points in K-means where K stands for the number of clusters we have decided to create or glean from the data.
Here is a graph showing how shoppers have been clustered based on the amount they spend at the shop and the frequency of orders they place.
Notice, 3 clusters have been formed with the red showing customers who spend small amounts and shop very frequently. The black cluster shows the group of customers who shop less frequently but spend large amounts. The blue cluster shows the customers who spend small amounts and are not so frequent shoppers either, the least profitable customers.
One of the biggest disadvantages of K-Means clustering is that you have to decide or choose upfront, the K value. i.e. the number of clusters. This is overcome in the hierarchical clustering.
In Hierarchical clustering, you either start with all data as one cluster and iteratively break down to many clusters depending on similarity criteria or you start with as many clusters as your data points and keep merging them till you get one large cluster of all data points. This leads to hierarchical clusters of 2 types:
The positive point of hierarchical clustering is that you do not have to specify upfront how many clusters you want.
It creates an inverted tree-shaped structure called the dendrogram. A sample dendrogram is shown here:
Because of this tree structure and the way the clusters are formed, it is called hierarchical clustering. This figure shows the clusters created for the same customer data that was used to derive the 3 clusters of customers in the above graph under K-Means. Here too you can see it suggests 3 clusters through green, red and turquoise clusters.
Interpreting the dendrogram is the most important part of the hierarchical clustering. Typically one looks for natural grouping defined by long stems.
The height of the dendrogram at which the different clusters are fused together represents the dissimilarity measure. Often it is the Euclidean distance again, that is calculated to understand how similar or dissimilar two points are to each other and that is represented by the height in the dendrogram. Here the distance is calculated between points themselves and not any centroid. Hence it is called connectivity-based clustering algorithm.
The clusters that have merged at the bottom are very similar to each other and those that merge later towards the top are the most dissimilar clusters. We will talk about linkages and its types in a later article that goes into more details about hierarchical clustering.
Here is a brief description of the two types of hierarchical clustering and how they differ from each other though both end up creating dendrograms.
This starts with all the data in one large cluster, from where it is divided into two clusters based on the least similarity between them and further divided into smaller clusters until a termination criterion is met. As explained earlier, this is based on connectivity which is essentially saying that all the points close to each other will belong to one cluster.
In this, the clusters are formed the other way round. Every data point is taken as its own cluster, to begin with, and then it starts aggregating them into most similar clusters till it goes up to making one single cluster of all the data available. Hence it is also known as the bottom-up method.
The difference between these two methods is pictorially represented here.
Clustering algorithms are unsupervised algorithms. There are many types of clustering algorithms each with its own advantages and disadvantages. The inter-cluster heterogeneity and the intra cluster homogeneity play a huge role in the creation of clusters.
Out of the four categories of clustering algorithms, we have looked at examples of two common types of algorithms - the K-Means and the Hierarchical clustering, at a very high conceptual level. The mathematics and the intricacies will be looked at, in subsequent articles.