• Sai Geetha M N

K-Means Clustering through An Example

Updated: Jan 16

Now that we have understood the basics of K-Means Clustering, let us dive a little deeper today.


Let us look at one practical problem and the solution


Problem Statement


An NGO that is committed to fighting poverty in backward countries by providing basic amenities and relief during disasters and natural calamities has got a round of funding. It needs to utilise this money strategically to have the maximum impact. So, we need to be able to choose the countries that are in direct need of aid based on socio-economic factors and health factors.


Let us use K-Means Clustering to create clusters and figure out the countries that are in greatest need as per the data provided.


You may find the data and the entire code in this git repo:

https://github.com/saigeethamn/DataScience-Clustering


Solution Thinking

If we want only the top 5 or top 10 countries that deserve aid, then we could think of a regression model. But we could also use clustering as a way to find out the cluster of most needy countries. Once we get the clusters, within that we could further analyse and decide where does the aid go.


I could have done K-Means Clustering or Hierarchical Clustering. I will go with K-Means for now, as we have understood that, in theory so far.


So, how and where do I start? I will be following the preliminary steps outlined in my previous post on "Steps towards Data Science or Machine Learning Models"


In this post, I will not explain the code for data analysis or preparation. I will just explain the bare minimum through plots and insights, as this code is pretty repetitive for all analyses. However, the K-Means part alone, I will walk through the code too.


Step 1: Reading and Understanding the Data

I have to load the data and understand it first. From the shape, I know that I have data of 167 countries and I have 10 columns of data including the country name. A brief description of the data is here:


Note that exports, health and imports columns are given as % of GDPP and hence they need to be converted back to absolute values for further analysis. You can refer to the notebook in the git repo to understand how that is done.


NOTE: The code for this article is fully available in this Github repo as a jupyter notebook. You can look at it in detail and execute it to understand the complete flow.

Step 2: Data Cleansing and Transformation

When I do a null value check, I do not find any missing data. Hence there is no null value treatment required and no columns or rows to be dropped either. All the data is numerical and so no categorical data encoding is also required.


Here, I should also do outlier analysis and treatment. However, I am interested in exploring the original data before I treat the outliers if any. Hence I move on to Step 3 consisting of EDA.


Step 3: Exploratory Data Analysis

The main steps here are univariate and bivariate analysis. I plot the distplot for all the data as shown here:


Univariate Analysis: Continuous Variables - Distplot

Most are left-skewed implying a large number of countries probably are in that cluster and there is a small number in the far-right cluster - behaviours of these 6 features:

  1. Child Mortality

  2. Exports

  3. Health

  4. Imports

  5. Income

  6. Inflation

Life expectancy, total fertility, income and gdpp show there are visible clusters. For Bivariate analysis, a heat map and pair plot are sufficient as all the data in continuous data


From this, I see that

  1. There is a high positive correlation between GDPP and income, life expectancy, exports, health and imports.

  2. There is a negative correlation between GDPP with Total fertility, Child Mortality and inflation

  3. Exports, imports and health are highly correlated

  4. Health is negatively correlated with Total Fertility, Inflation and Child Mortality

  5. There is a strong correlation positive between Total fertility and Child mortality

  6. Also a positive correlation between income and life expectancy

Hence, we have a good chunk of correlated data that should help in creating clusters. a scatter plot also helps is see if there are any visible clusters and hence you do a pair plot like this one:


Having understood the basic data, we move to the next step of data preparation.


Step 4: Data Preparation

Outlier Checks and Treatment

Since all the data is continuous data, we can look at box plots and see if there are outliers.



From this, we see that child mortality, exports, imports, income and inflation have outliers on the higher end and life expectancy at the lower end. Need to be watchful about capping the high-end values of data like inflation, child mortality and lower-end values of life expectancy as the needy countries should not lose out on aid due to this.


However, it is safe to cap the higher end values of income, exports, imports, gdpp. Hence I have chosen to cap the higher end at 0.99 quantile and the lower end to 0.1 quantile.

Not capping 'health' as it has almost continuous values outside the 100th percentile and that itself could contribute to a cluster. Again refer to the notebook in git to view the code for this.


Scale the Variables

Next, we scale the variables using a StandardScaler from sklearn.preprocessing library. Here, we do not split data into train and test as we are finding clusters across all of the data. It is not supervised learning and we do not test predictions against any target variable.


Additional Step - Check for Cluster Tendency

In part 3 of the theory on K-Means, I have spoken about having to check for the cluster-tendency of the given data. So, now we run the Hopkins test to check if this data shows a cluster-tendency.


A basic explanation for Hopkins statistic is available on Wikipedia and a more detailed level discussion is available here. It compares the data on hand with random, almost uniform data and comes up with whether the given data is almost as uniform or shows a clustering tendency. For this data, as seen in the code, we get a value of anywhere between 0.83 to 0.95 indicating that there is a possibility of finding clusters and hence we go ahead with K-Means Clustering


Step 5: K-Means Clustering


The first step in modelling is to figure out what is correct K for our data since we want to initialise the model with K. Again as mentioned in part 3, this is done using the elbow method or the silhouette analysis.


First, let's see the code for KMeans clustering with a random k. The code for clustering itself is literally 2 lines.

We have to import the KMeans library from sklearn.


from sklearn.cluster import KMeans

If we choose to go with any arbitrary number for K and create the cluster, here's how the code would look:


kmeans = KMeans(n_clusters=4, max_iter=50)
kmeans.fit(country_scaled)

We instantiate an object of KMeans class as kmeans. There are 2 args we pass: one is k i.e. the number of clusters we want to create. Here it has been arbitrarily chosen as 4. The second is the maximum number of iterations the algorithm has to go through. Recollect that 2 steps of calculating distance and reassigning points to a centroid happen iteratively. These two steps may not always converge. So, in such a case, stopping after 50 iterations is what the max_iter stands for and returning the clusters formed at the last iteration. There are a lot more arguments that you can look at help and understand. But this is the bare minimum for invoking the KMeans algorithm. Then you just take this kmeans instance and fit it against the scaled country data and four clusters are formed. It is as simple as that!!


However, deciding the value of K is a very important aspect and let us see how we decide the optimal number of clusters.


Optimal Number of Clusters

1. Elbow Curve

We create multiple clusters starting with k=2, and going on with 3, 4, 5 and so on. When adding more number of clusters is not beneficial, we stop at that point. We start with using K-Means clustering with K=2 to say 10. Here's how the code looks.


K-Means algorithm used is from the library sklearn.cluster


ssd = []
for k in range(2, 10):
    model= KMeans(n_clusters = k, max_iter=50, random_state=100).fit(country_scaled)
    ssd.append([k, model.inertia_])
    
plt.plot(pd.DataFrame(ssd)[0], pd.DataFrame(ssd)[1])
plt.xlabel("Number of Clusters")
plt.ylabel("Total Within-SSD")
plt.show()

And the plot we get is:


Now, let us understand what we are doing in the code.

model= KMeans(n_clusters = k, max_iter=50, random_state=100).fit(country_scaled)

Here we call the fit method on KMeans for each k value ranging from 2 to 10 and create the model. And then we use an attribute from the model to understand which value for K gives good clusters.


KMeans algo has an attribute called intertia_ which you can see in the sklearn documentation or by executing the command help(KMeans) in your jupyter notebooks. inertia_ is defined as the "sum of squared distances of sample to their closest cluster centre". So, if you have 3 clusters centres and each point is associated with one of them, then the squared distance of each of the points with their respective centres is given by inertia_. In fact, this is the cost function that we want to minimise as discussed in part 2 of my series on KMeans Theory.


So, we capture this for every k value in the range - in a list variable called ssd:

    ssd.append([k, model.inertia_])

And the next set of statements plot the value of inertia_ against the k value. So, wherever we get a significant dip in inertia_, we take that as the k value of choice. After a particular k the inertia_ doe