K-Means Clustering through An Example

Now that we have understood the basics of K-Means Clustering, let us dive a little deeper today.

Let us look at one practical problem and the solution

Problem Statement

An NGO that is committed to fighting poverty in backward countries by providing basic amenities and relief during disasters and natural calamities has got a round of funding. It needs to utilise this money strategically to have the maximum impact. So, we need to be able to choose the countries that are in direct need of aid based on socio-economic factors and health factors.

Let us use K-Means Clustering to create clusters and figure out the countries that are in greatest need as per the data provided.

You may find the data and the entire code in this git repo:

https://github.com/saigeethamn/DataScience-Clustering

Solution Thinking

If we want only the top 5 or top 10 countries that deserve aid, then we could think of a regression model. But we could also use clustering as a way to find out the cluster of most needy countries. Once we get the clusters, within that we could further analyse and decide where does the aid go.

I could have done K-Means Clustering or Hierarchical Clustering. I will go with K-Means for now, as we have understood that, in theory so far.

So, how and where do I start? I will be following the preliminary steps outlined in my previous post on "Steps towards Data Science or Machine Learning Models"

In this post, I will not explain the code for data analysis or preparation. I will just explain the bare minimum through plots and insights, as this code is pretty repetitive for all analyses. However, the K-Means part alone, I will walk through the code too.

Step 1: Reading and Understanding the Data

I have to load the data and understand it first. From the shape, I know that I have data of 167 countries and I have 10 columns of data including the country name. A brief description of the data is here:

Note that exports, health and imports columns are given as % of GDPP and hence they need to be converted back to absolute values for further analysis. You can refer to the notebook in the git repo to understand how that is done.

NOTE: The code for this article is fully available in this Github repo as a jupyter notebook. You can look at it in detail and execute it to understand the complete flow.

Step 2: Data Cleansing and Transformation

When I do a null value check, I do not find any missing data. Hence there is no null value treatment required and no columns or rows to be dropped either. All the data is numerical and so no categorical data encoding is also required.

Here, I should also do outlier analysis and treatment. However, I am interested in exploring the original data before I treat the outliers if any. Hence I move on to Step 3 consisting of EDA.

Step 3: Exploratory Data Analysis

The main steps here are univariate and bivariate analysis. I plot the distplot for all the data as shown here:

Univariate Analysis: Continuous Variables - Distplot

Most are left-skewed implying a large number of countries probably are in that cluster and there is a small number in the far-right cluster - behaviours of these 6 features:

Child Mortality
Exports
Health
Imports
Income
Inflation

Life expectancy, total fertility, income and gdpp show there are visible clusters. For Bivariate analysis, a heat map and pair plot are sufficient as all the data in continuous data

From this, I see that

There is a high positive correlation between GDPP and income, life expectancy, exports, health and imports.
There is a negative correlation between GDPP with Total fertility, Child Mortality and inflation
Exports, imports and health are highly correlated
Health is negatively correlated with Total Fertility, Inflation and Child Mortality
There is a strong correlation positive between Total fertility and Child mortality
Also a positive correlation between income and life expectancy

Hence, we have a good chunk of correlated data that should help in creating clusters. a scatter plot also helps is see if there are any visible clusters and hence you do a pair plot like this one:

Having understood the basic data, we move to the next step of data preparation.

Step 4: Data Preparation

Outlier Checks and Treatment

Since all the data is continuous data, we can look at box plots and see if there are outliers.

From this, we see that child mortality, exports, imports, income and inflation have outliers on the higher end and life expectancy at the lower end. Need to be watchful about capping the high-end values of data like inflation, child mortality and lower-end values of life expectancy as the needy countries should not lose out on aid due to this.

However, it is safe to cap the higher end values of income, exports, imports, gdpp. Hence I have chosen to cap the higher end at 0.99 quantile and the lower end to 0.1 quantile.

Not capping 'health' as it has almost continuous values outside the 100th percentile and that itself could contribute to a cluster. Again refer to the notebook in git to view the code for this.

Scale the Variables

Next, we scale the variables using a StandardScaler from sklearn.preprocessing library. Here, we do not split data into train and test as we are finding clusters across all of the data. It is not supervised learning and we do not test predictions against any target variable.

Additional Step - Check for Cluster Tendency

In part 3 of the theory on K-Means, I have spoken about having to check for the cluster-tendency of the given data. So, now we run the Hopkins test to check if this data shows a cluster-tendency.

A basic explanation for Hopkins statistic is available on Wikipedia and a more detailed level discussion is available here. It compares the data on hand with random, almost uniform data and comes up with whether the given data is almost as uniform or shows a clustering tendency. For this data, as seen in the code, we get a value of anywhere between 0.83 to 0.95 indicating that there is a possibility of finding clusters and hence we go ahead with K-Means Clustering

Step 5: K-Means Clustering

The first step in modelling is to figure out what is correct K for our data since we want to initialise the model with K. Again as mentioned in part 3, this is done using the elbow method or the silhouette analysis.

First, let's see the code for KMeans clustering with a random k. The code for clustering itself is literally 2 lines.

We have to import the KMeans library from sklearn.

from sklearn.cluster import KMeans

If we choose to go with any arbitrary number for K and create the cluster, here's how the code would look:

kmeans = KMeans(n_clusters=4, max_iter=50)
kmeans.fit(country_scaled)

We instantiate an object of KMeans class as kmeans. There are 2 args we pass: one is k i.e. the number of clusters we want to create. Here it has been arbitrarily chosen as 4. The second is the maximum number of iterations the algorithm has to go through. Recollect that 2 steps of calculating distance and reassigning points to a centroid happen iteratively. These two steps may not always converge. So, in such a case, stopping after 50 iterations is what the max_iter stands for and returning the clusters formed at the last iteration. There are a lot more arguments that you can look at help and understand. But this is the bare minimum for invoking the KMeans algorithm. Then you just take this kmeans instance and fit it against the scaled country data and four clusters are formed. It is as simple as that!!

However, deciding the value of K is a very important aspect and let us see how we decide the optimal number of clusters.

Optimal Number of Clusters

1. Elbow Curve

We create multiple clusters starting with k=2, and going on with 3, 4, 5 and so on. When adding more number of clusters is not beneficial, we stop at that point. We start with using K-Means clustering with K=2 to say 10. Here's how the code looks.

K-Means algorithm used is from the library sklearn.cluster

ssd = []
for k in range(2, 10):
    model= KMeans(n_clusters = k, max_iter=50, random_state=100).fit(country_scaled)
    ssd.append([k, model.inertia_])
    
plt.plot(pd.DataFrame(ssd)[0], pd.DataFrame(ssd)[1])
plt.xlabel("Number of Clusters")
plt.ylabel("Total Within-SSD")
plt.show()

And the plot we get is:

Now, let us understand what we are doing in the code.

model= KMeans(n_clusters = k, max_iter=50, random_state=100).fit(country_scaled)

Here we call the fit method on KMeans for each k value ranging from 2 to 10 and create the model. And then we use an attribute from the model to understand which value for K gives good clusters.

KMeans algo has an attribute called intertia_ which you can see in the sklearn documentation or by executing the command help(KMeans) in your jupyter notebooks. inertia_ is defined as the "sum of squared distances of sample to their closest cluster centre". So, if you have 3 clusters centres and each point is associated with one of them, then the squared distance of each of the points with their respective centres is given by inertia_. In fact, this is the cost function that we want to minimise as discussed in part 2 of my series on KMeans Theory.

So, we capture this for every k value in the range - in a list variable called ssd:

    ssd.append([k, model.inertia_])

And the next set of statements plot the value of inertia_ against the k value. So, wherever we get a significant dip in inertia_, we take that as the k value of choice. After a particular k the inertia_ does not show any significant improvement.

So, we see that there is a sharp dip in ssd from K=2 to K=3. Then the rate of fall slows down from K=4. It further slows down with higher Ks. Because of the shape of the curve at K=3, it is called an elbow curve. Given this insight, we could choose K as 3.

2. Silhouette Score

Now let us look at Silhouette Score too.

Broadly speaking, It is a measure of the goodness of the clusters created. We have understood in Part 1 of the series on KMeans that we want the maximization of the inter-cluster distance and the minimization of the intra-cluster distance. This is what is encapsulated in the silhouette score.

In other words, a metric that measures the cohesiveness of a cluster and the dissimilarity between clusters is called the silhouette score.

It is represented as follows:

where

p is the average distance to the points in the nearest cluster that the data point is not part of.

q is the average intra-cluster distance to all the points in its own cluster.

Let us understand the intuition behind this. By definition of maximization of inter-cluster distance, p should be as large as possible and by minimization of intra-cluster distance q should be as small as possible. If q is very small, the ratio is almost p/p and hence 1. If q is very large, the ratio is -q/q and hence -1

Therefore, the silhouette score combines the two (p and q) and ranges from -1 to 1. A score closer to 1 indicates that the data point is very similar to other data points in the cluster and a score closer to -1 indicates that the data points are not similar to other data points in its cluster.

This is calculated for every point and for every K. Then the same is plotted on a graph for every k.

So, whichever K has the maximum silhouette score is the one with the best inter-cluster heterogeneity and intra-cluster homogeneity. The silhouette scores seem to be very similar for k = 3, 4 or 5.

Here is the code that is written to calculate and plot the silhouette score

from sklearn.metrics import silhouette_score
ss = []
for k in range(2, 10):
    kmeans = KMeans(n_clusters = k, max_iter=50, random_state=100)
    kmeans.fit(country_scaled)
    silhouette_avg = silhouette_score(country_scaled, kmeans.labels_)
    ss.append([k, silhouette_score(country_scaled, kmeans.labels_)])
    
plt.plot(pd.DataFrame(ss)[0], pd.DataFrame(ss)[1])
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.show()

Here we use the silhoutte_score from sklearn's metrics library. We create the KMeans clusters for each K in the range 2 to 10. And to the silhouette_score, we pass the scaled country data and the labels returned by KMeans to help in calculating the intra and inter-cluster distance averages for every point in this line:

silhouette_avg = silhouette_score(country_scaled, kmeans.labels_)

Finally, we gather the score against each k in the list named ss[] to help in plotting the graph.

Based on both of these tests, looks like 3 seems to be the right number of clusters. So, we will go ahead with this value of K and create 3 clusters.

kmeans3 = KMeans(n_clusters=3, max_iter=50, random_state=100)
kmeans3.fit(country_scaled)

Then we do the cluster analysis to see what direction or insight we get out of it. This cluster Profiling or analysis can help us finally say which are the countries that are in direst need of aid.

Step 6: Cluster Profiling for KMeans

Let's now start with understanding how many countries are in each cluster:

From the above code, you can see that there are 90 in cluster 2, 29 in cluster 1 and 48 in cluster 0.

Let us now plot a scatter plot with the 3 most important variables: income, GDPP and Child Mortality, for each cluster:

You can see the countries represented in red dots have low GDPP, low income and high child mortality. They would be the countries that would best benefit from aid.

We can plot a bar graph and box plots to understand whether these clusters are truly distinct in their characteristics.

The bar graph shows that the gdpp and income are quite different for the 3 clusters. The box plots show how the mean of gdpp and income is very low for cluster 0 while the child mortality is very high. This makes the profile of the countries very clear.

Since we want only the top 10 countries, we can sort by gdpp, child mortality and income in the ascending, descending and ascending orders respectively and take the top 10 countries for providing aid:

If the budget was able to support, the entire list of 48 countries in this cluster could have been considered as each is only slightly better than the other. However, with budgets constraints, the top 5 to 10 needy nations could be considered so that the impact is at least felt and a difference is made to the people receiving the aid.

In Conclusion:

Finding K for K-Means is an important pre-step for clustering. There are multiple methods to find the appropriate K. Cluster Profiling helps us derive more insights. However, data understanding, data preparation and transformation before clustering are also important steps that cannot be overlooked.

Developing a model is mostly the easiest step. However, again deriving insights from the model to get actionable results requires a deep enough understanding of the problem on hand and the implications at the ground level.

The entire code for this is available in the git repo whose link is given above. With more details in comments and explanations at each step. This is a fairly simple problem that was addressed through KMeans clustering.

Hope this was a useful code walkthrough with a significant example.

Decision Trees through an Example

Decision Trees - Feature Selection for a Split

Decision Trees - Homogeneity Measures