Hierarchical Clustering: A Deep Dive
In the last five blog posts, I have discussed the basics of Clustering and then, K-Means clustering in detail. In my "Introduction to Clustering Algorithms" I have introduced you to many categories of clustering algorithms. We have also seen that K-Means is a centroid based algorithm and hierarchical clustering is a connectivity-based algorithm.
Today we shall delve deeper into Hierarchical clustering.
Why Hierarchical Clustering?
In K-Means, when we saw some of the practical considerations in Part 3 we saw that we have to start with a value for K i.e. the number of clusters we want to create out of the available data. This is an explicit decision the modeller has to make. We also saw some methods to arrive at this K like the silhouette analysis and the elbow curve. However, this is in a way, forcing the data into a pre-determined set of clusters. This limitation is overcome in hierarchical clustering.
This is one distinct feature that makes it more advantageous than K-Means clustering in certain cases, though this becomes a very expensive proposition if the data is very large.
Similarity or Dissimilarity Measure
Now, if we do not have an upfront K value, we have to have some measure to help us decide whether a point creates a cluster on its own or it is very similar to other points in an existing cluster and hence belongs there. This is what is called the similarity or dissimilarity measure. Euclidean distance is the most common measure used as a dissimilarity measure. If the points are very similar, their euclidean distance would be very small and that implies they are very close to each other. In other words, the points with a very small dissimilarity measure are closer to each other.
The Iterative Process
Let us start with a sample data of 14 points, given here and see how the clustering is done.
Let us plot a scatter plot of this data and see if there are any visible clusters. Since we have only two dimensions, we have the luxury of visual representation.
When we move to actual industry problems, we would be dealing with much higher dimension data and hence visually, we can only do exploratory analysis with pair-wise data to get a feel for the clusters, at the best.
Looking at the plot above, we do see two-three clusters at least. Let us find out through hierarchical clustering, how many clusters are suggested and how close they are to each other.
The first step would be to assume that every point is its own cluster as shown in the diagram on "Startin Step". Since we have 13 data points, we are starting with 13 clusters.
Then, we calculate the euclidean distance between every point and every other point.
This will help us create the first cluster between the two nearest points.
In this case, points 10 and 11 are the nearest and they form the first cluster as shown in iteration 1.
Now you treat these two points as one cluster and so you have 12 clusters at the end of the first iteration or at the beginning of the 2nd iteration. During the 2nd iteration, point 9 joins the first cluster containing points 10 and 11, as that is the next nearest, as shown in Iteration 2.
This process goes on iteratively and the clusters that are closest to each other keep getting fused till we get one large cluster containing all the points. This is called Agglomerative Clustering or AGNES (Agglomerative Nesting).
This iterative process leads to the formation of a tree-like structure called the dendrogram. You can see the dendrogram for the above data in the figure below:
The first step is indicated by the smallest orange line joining points 10 and 11. The second iteration is represented by the orange line that joins 9 into the same cluster. In the third iteration, points 1 and 2 form their own cluster. You go by the height to know which is the next data point that formed a cluster.
The height of the dendrogram is a measure of the dissimilarity between the clusters and in this case that is the euclidean distance between the points. Based on this you can see that the dissimilarity is very low between points 5 and 6, 0 and 8, 1 and 2 respectively and they quickly form their own clusters on each subsequent iteration. The height at which they join also is very small indicating that the euclidean distance is very small between them.
When you see long lines that create branches way below, you know that the dissimilarity measure or the euclidean distance is very high between them and hence they form very distinct clusters. In the above diagram, the blue line is clearly indicating the dissimilarity is very high between the orange and the green clusters.
But how do we calculate the distance between a cluster containing many points (more than one) and another data point outside the cluster? This leads us to the concept of linkage.
Linkage is a measure of dissimilarity between clusters having multiple observations.
In a multi-point cluster, distance is calculated from every point in the multi-point cluster with an external point and the minimum distance is taken, often. This way of calculating the distance is one type of linkage called single linkage. This is not the best though in terms of good quality of clustering but is chosen because of the compute efficiencies involved.
There are other types of linkage that we can see in a subsequent post.
Deciding the Clusters
Having now gone through the AGNES process, we finally have to decide how many clusters to create and what are they? How do we identify the clusters using the dendrogram?
You do a horizontal cut on the dendrogram and every group below becomes a distinct cluster. If you cut at a height of 7 along the Y-axis, you create two clusters as shown in orange and green.
If you cut at a height of 5, you get three clusters consisting of data points12-9-10-11 in one cluster, 4-5-6-7 and 0-8-3-1-2 in the other two clusters
Libraries that exist for hierarchical clustering allow to "cut-tree" like the above and come up with varying number of clusters.
Hierarchical clustering is used when we do not want to decide the number of clusters upfront. It is computationally intensive due to the way linkages are established between the points to create the clusters.
There are two ways of hierarchical clustering, AGNES or Agglomerative Nesting and DIANA or Divisive Analysis, both of which have been discussed in the Introduction to Clustering Algorithms. While the former starts with each point as an independent cluster and goes iteratively to have all the points in one large cluster, the latter does the opposite.
In other words, bottom-up clustering is AGNES and top-down clustering is DIANA.
Most often the dissimilarity measure used is Euclidean distance. Linkages are the way this is calculated for multi-point clusters. This in summary is all about Hierarchical Clustering.