There are a few more concepts that would help in using decision trees practically. However, I felt sharing a good piece of code to show how they are built, would help in taking a break from too much theory.

To appreciate or understand this, please go through these posts before you start on this example.

__Decision Trees - An Introduction__

__Decision Trees - How to decide the split?__

__Decision Trees - Homogeneity Measures__

__Decision Trees - Feature Selection for a Split__

I have downloaded the data from Kaggle - called the "Car Evaluation Data Set" for this example. This uses the overall price of the car, the maintenance cost, the number of doors, the number of people it can accommodate, the boot capacity and safety to decide whether that car is acceptable, good or very good or totally unacceptable.

In the notebook, the first step is to read and understand the data. The code there is simple and straightforward and hence not walking through that here.

The next step is to analyse and prepare data.

The data has no missing values. Also, it has all categorical data and hence no question of having outliers. Therefore, we move straight into splitting the data for train and test, following which, categorical encoding of data is done.

Create the feature set X and the target variable y

Split the data set into train and test data sets:

Now, the categorical encoder is used to convert all the features into ordinal variables

After encoding, the features' data has changed like this:

compared to the original data which was like this:

Please read this post on __Types of Variables__, if you want to understand more about types of variables like categorical and numerical etc.

If the variables are ordinal in nature i.e. they have an inherent order in them, like the cost is low, medium, high, or very high, has an inherent order in it. Then, we use "OrdinalEncoder" for encoding the same.

Now that the data is ready, we can start modelling the data.

With these three lines of code, the model is ready!!

However, there are a lot of parameters that need to be tuned or controlled to get a useful decision tree model. Here, I have controlled only the depth of the model creation to 3. Some of the other parameters that can be tuned are max_features, max_leaf_nodes, min_impurity_decrease and so on. The default impurity criterion is 'gini' and hence gini index as described in __Decision Trees - Homogeneity Measures__ is used for calculating the impurity of a node.

Following the creation of the model, you would want to visualize the model.

This piece of code provides a very basic visualisation.

However, I have tried to visualise the model using graphviz and pydotplus libraries. Installation of these could be done in your python environment using 'pip install' or you conda environment using 'conda install'.

The graph that is obtained gives a lot of visual information about the measure that was used to check the homogeneity of the nodes and the features based on which a split has been done.

I am not getting into the code that draws the graph as it is standard boilerplate code that will work for any graphviz object given out by sklearn modules

The tree shows that the first check is based on safety. Then, it checks on the number of persons followed by maintenance cost. Since we see the max_level is 3 it stops there. And it keeps separating out groups with a Gini of 0.0 meaning they are completely pure groups.

The root node Gini is 0.452. Meaning it is not a completely pure node. The first check done is whether safety <= 2.5. This node started with a sample size of 1382.

value = [301, 58, 975, 48] tells us how many of each category of cars are there in this node, the categories being acceptable, good, unacceptable and very good.

And the 'class' of course tells us whether this node is classified into which category. The class that occurs the most in the node dictates the class of the node.

At every level, you can see that based on some criteria, the pure node consists of only unacceptable cars and hence a Gini index of 0.0

Having visualized the model, now we want to validate if this model works well on unseen data (here it is the test data).

So, predict the class of the test data using the fitted model

Then you check the accuracy and the confusion matrix

You do the same for train data also, to compare the two scores and see if the model has overfitted.

The output obtained for test data is:

This says the accuracy of prediction on the test data is 81.79% which is a decent accuracy score. The confusion matrix is a 4 x 4 matrix in this case as it is a case of multi-class classification problem. Since we have 4 classes of classification, it is a 4 x 4 matrix.

How to interpret the confusion matrix is something that I will reserve for another post. In short, the confusion matrix gives a count of true positives, true negatives, false positives known as type 1 errors and false negatives known as type 2 errors.

The train data set accuracy score is 80.24% and that of the test data set is 81.79%. They are very close to each other and hence we can conclude that a good model has been created by the sklearn's DecisionTreeClassifier with a max_depth parameter of 3.

The model creation is a set of simple steps. However, hyperparameter tuning and getting a good model is an art that is learnt with experimentation and experience over time. Also, the data in real life is never so well prepared upfront. it has to be cleaned and prepared to be fed into a library that creates the model.

]]>In the previous two articles "__Decision Trees- How to decide the split?__" and "__Decision Trees - Homogeneity Measures__", I have laid the foundations for what we will look at in this post.

There are many algorithms that are used to decide the best feature for splitting at a particular point in the decision tree build-up. To name a few:

CART (Classification and Regression Trees)

C4.5

C5.0

CHAID

ID3

Each of them has its own nuances and is suitable in different scenarios. However, CART is one of the basic ones and that is what we will look at today. This is mainly used for binary splits whereas CHAID is for multi-way splits.

CART is also the default implementation provided in many libraries like the python scikit library.

What this does is that it calculates the impurity of a node before a split and then after a split based on a feature. It then looks at the delta or the decrease in impurity. That feature which reduces the impurity the most is selected as the "Split-Feature" at that step.

The change in impurity is also known as **purity gain **or **information gain**, sometimes.

To succinctly put it, the algorithm iteratively runs through these three steps:

Use the Gini Index to calculate the pre and the post-impurity measure

Calculate the delta or the purity gain/information gain

Do a split based on the feature with maximum information gain.

This is repeated till we meet an end criteria for the decision tree creation. The end criteria could be that the purity of all the leaf nodes is above an expected threshold. For example, if the purity is greater than 70%, you do not want to split further.

There are also other criteria that can be used to stop the creation of Decision trees.

Suppose we have a dataset of a population of 100 people - some with diabetes and some without. We also know their gender and fasting sugar levels. Now we want to decide, whether should we first split based on gender or based on fasting sugar levels.

Here is what the original data looks like based on gender and fasting sugar levels:

Just reading through the data: There are 45 and 20 non-diabetics, male and female respectively. There are 20 and 15 diabetics similarly. Totally a 65:35 male to female ratio and a similar non-diabetic to diabetic ratio.

On similar lines, 60 have fasting sugars < 120mg/dl and 5 >120mg/dl among non-diabetics. There are 5 <120 mg/dl and 30 > 120 mg/dl among diabetics. While the choice of the feature to split by may seem obvious from these 2 features, it may not be the same when we have many features and large data sets.

Let us now try to split based on the feature "Gender". How would it look? The node before the split and after can be represented this way:

Let us now calculate the Gini index before and after the split, to see the purity gain or information gain.

We first calculate the probability of a data point being that of a non-diabetic and then the probability of a data point being that of a diabetic.

Using these probability values, the Gini impurity is calculated as:

The probability of the diabetic and the non-diabetic classes in the **male cluster** turns out to be

Therefore the Gini Impurity of the **male node** becomes:

Similalry, let us calculate the probability of the diabetic and the non-diabetic patients in the **female node** and using these, the Gini Impurity itself.

To get the overall impurity after the split, we need to take the weighted sum of the impurities measured on both the split nodes. How do we take the weighted sums?

We see the probability of being male and multiple that with the Gini impurity of the male node and then similarly, multiply the probability of a female in the dataset and multiply that with the Gini index of the female node.

The probabilities of the genders are:

Therefore the Gini Impurity of the split nodes, based on gender, taken as a weighted sum is

The final reduction in the impurity of the information gain is given by:

If we were to split the root node based on the fasting sugar values, here is what we would get:

Based on these values, we want to calculate the Gini index of the root node (already done above) and the Gini index of the child nodes, to come up with the change in impurity levels.

The Gini impurity of the node with all people with < 120 mg/dl would turn out to be:

And the Gini Impurity for the node with sugars > 120 is

We find the weights of the two classes and then find the weighted sum of impurities of the split nodes.

Final Reduction in Impurity by splitting based on fasting sugars:

From the above set of Gini delta calculations, you see that the change in impurity for the split based on **Gender is only 0.0263** while that of the split based on **fasting sugars is 0.273**, almost 10 times more. Hence the CART algorithm would choose to first split based on fasting sugars rather than based on Gender.

There is another conclusion that we derive based on these calculations. That is, f**asting Sugar is a more important feature for predicting diabetes than Gender**. This is common knowledge in this example but is extremely helpful in deciding the feature importance even in not such obvious feature sets.

With this, we have looked at all the basic aspects of decision trees and we are ready for a code-based deep dive into implementing a decision tree.

Hope you gained some knowledge here.

]]>Just to recap, I said that the most often used measures of impurity are

Gini Index

Entropy

The other measure called the Classification Error is hardly ever used but helps us in getting an understanding of how impurity is measured and hence I will explain that too, here.

This is the error we see when we go with a simplistic method of assigning the majority class label to a node. It means all of the minority data points are misclassified. The probability of misclassification of the minority points is taken as the classification error. Very Simple!

Let us take an example:

Assume we have a dataset with 1000 points. 200 of them belong to Class A and 800 of them belong to Class B, as shown here:

This dataset is neither clearly impure nor clearly pure. It seems to have a majority of Class B data. So, to check the purity of this node, we take the probability of a point belonging to either class A or class B

The probability of a data point belonging to class A in this data set can be written as :

And the probability of a data point belonging to class B can be written as:

So, now if we assign the majority class label to this node, it would get the label of Class B. This would imply that 20% of the points (belonging to Class A) are misclassified. They actually belong to class A but are classified as class B. Hence, **the probability of the minority class becomes the error rate**. Therefore, the classification error E is defined as:

where p_max is the probability of the majority class.

This should intuitively explain what we are trying to call a **classification error**. The probability of the minority class itself is the error rate. It is sometimes also called the **misclassification measure**.

This should lay the intuitive basis for understanding the measure we use to decide the impurity or purity of a node

Let's move on to Gini Index

Gini index is defined as the sum of p(1-p) over all classes where p is the probability of each class and is represented better as:

where i runs from 1 to K - the number of classes in the data.

So, if we take the same example for which we calculated the classification errors, the Gini index would be:

For each class, we calculate p(1-p) and add them up to get the Gini Index.

We see that while the classification error value was 0.2, the Gini Index is 0.32. It is a bit harsher on the misclassification and rightfully so. i.e. it penalizes the misclassification a little more harshly.

Now let us look at Entropy

This concept has come from Physics - in particular thermodynamics where entropy is defined as *a measure of randomness or disorder of a system. *And in some sense that is what we are also trying to measure when we look for impurity.

Here, it is defined as the sum of p*logp (log to the base 2) where p is the probability of each class in that dataset. It is represented as:

If we apply this formula to the same example above, here's what we get:

Here, the score seems, even more, harsher than Gini Index. But when scaled equally, they almost penalize very similarly to each other, which we will understand later.

To understand how these three measures compare to each other and what they convey as well, we will have to go with a few more cases of impurity combinations in order to plot a graph for each of them and understand them.

Let us take the case when there is an equal number of data points from 2 different classes in a data node. i.e. 50% each.

If we take the probability of both the classes as 0.5 and apply the three formulae, we get the following values:

Classification error = 0.5

Gini Impurity = 0.5

Entropy = 1

These are the maximum impurity values that each can take on. The minimum is 0 in all the cases.

To take this further, I have taken the following dataset of class distribution variation and plotted a graph.

Starting from a pure dataset with only Class B data points and then going on increasing the impurity till both the classes are 50 each and finally reversing the impurity of the classes, the data you get for all three measures is shown here.

If we were to plot this data onto a graph, what you find is this:

We see that the graphs are symmetric in nature, as expected. It does not matter which class is the majority, the impurity is the same, even if the numbers are swapped between the classes. Here log (0) is not taken as undefined but approximated to 0 for the Entropy function.

Notice the obvious symmetry and the fact that the classification error is linear, raises much slower than the other two, and falls similarly. Entropy and Gini are more rapid in the rise and fall and are actually very similar to each other if entropy was scaled down to Gini levels.

In practice, either of them can be used in terms of the measurement of impurity. However, computing power requirements may tilt many to use Gini more often.

Hope you enjoyed this. I would look forward to hearing any feedback or comments.

In the __introduction to Decision trees__, we have seen that the whole process is to keep splitting one node into two based on certain features and feature values.

The idea of the split is to ensure that the subset is more pure or more homogeneous after the split.

There are two aspects we need to understand here:

The concept of homogeneity or purity - what does it mean?

How do we measure purity or impurity?

Only then, we can use this for splitting the nodes correctly.

Let us take an example to understand this concept.

Take a look at these two sets of data:

In dataset A, we have 4 boys and 4 girls. An equal number of both genders. This means that this data set is a complete mix or completely non-homogenous as there is a big ambiguity about which gender this data set represents.

However, if you look at dataset B, you will see that all are girls. there is no ambiguity at all. This is, therefore, said to be a completely homogenous dataset with 100% purity or no impurity at all.

And all data could lie somewhere in between these two levels of impurity: Totally impure to totally pure.

This concept, though it seems trivial or pretty obvious, becomes the foundation stone on which decision trees are built and hence I thought of calling this out explicitly.

Since decision trees can be used for both classification and regression problems, the homogeneity has to be understood for both. *For classification purposes, a data set is completely homogeneous if it contains only a single class label. For regression purposes, a data set is completely homogeneous if its variance is as small as possible*

We want to be able to split in such a way that the new groups formed are as homogenous as possible. We are not mandating a 100% purity but intend to move towards more purity than the parent node.

And we will see later (in another post) that the variable or feature we use for splitting also uses this concept of increasing homogeneity.

Going further, this concept also helps in determining the importance of a feature to an outcome, a very useful aspect when we want to take actions at a feature level.

There are various methodologies that are used for measuring the homogeneity of a node.

The two commonly used ones are:

Gini Index &

Entropy

But to better understand this concept, I will also look at another method by name "Classification Error" or Misclassification measure

This is mostly never used in real life problems but helps clarify the concept of an impurity measure with ease.

This is the error we see when we assign a majority class label to a node. The number or probability of the whole minority data points being misclassified is called a classification error.

Therefore, the formula for the classification error is:

We will understand this with an example in the next post.

Similarly, I will share the formula for Gini Index and Entropy as well here but get into details with examples in the next post

This is defined as the sum of p(1-p) where p is the probability of each class and is represented better as:

where there are K number of classes

This concept has come from Physics - in particular thermodynamics where entropy is defined as *a measure of randomness or disorder of a system. *And in some sense that is what we are also trying to measure when we look for impurity.

Here, it is defined as the sum of p*logp (log to the base 2) where p is the probability of each class in that node. It is represented as:

What do these formula convey is something that we can get into with some examples next week and understand how each one of these measures help in meaning the purity or the impurity of a node.

There are a few characteristics of Decision Trees that makes them stand out as useful algorithms in specific situations.

1. **They are highly interpretable.**

If a patient falls in the last red box on the right and is diagnosed as diabetic, you know why. You can explain that this is a male and that his fasting sugars are around 180 and hence a diabetic, with a high probability.

Interpretability is an important need for many organizations to go with an algorithm. If something goes wrong with a business decision, you at least should be able to explain, what went wrong and correct it. If the interpretability is not present, many top stakeholders may be squirmy about accepting your algorithms.

**2. It is a versatile algorithm.** It can be used for classification or regression problems. We know that classification is used when the target variable is discrete and we use regression when it is continuous. So, in decision trees, we could check the purity or impurity based on the homogeneity of the class of the node for classification but could use something like the sum of squared errors (SSE) to find the lowest SSE point for a split in regression. i.e. by just changing the measure of purity, both classes can be handled.

3. **Decision trees handle **__ multi-colinearity__ better than linear regression does. In fact, multi-colinearity does not matter here. We cannot interpret the linear regression well if multi-colinearity is not handled. But that is not the case here. Decision trees have no impact if the data is multi-colinear

4. Building the tree with splits is **pretty fast **and works well on large datasets too.

5. **It is also scale-invariant.** Unlike in linear regression, the importance is not given based on the varying scales. The values are compared only within an attribute and hence without scaling, you can use data for decision trees. You can refresh your memory on __Feature Scaling in my earlier article__.

6. Another important advantage of decision trees is that it can **work with data that has a non-linear relationship between predictors and the target variable**. It will partition that data into subsets that are probably linear. Hence creating a sufficient number of splits helps in dealing with non-linear relationships.

So it has carved its own niche in solving problems because of these advantages.

However, it has certain **disadvantages** too and that needs to be kept in mind as well when finally deciding to go with it or not

Decision Trees can create overly complex trees that lead to overfitting

They are also said to be an unstable model as they can vary largely with even small variations in training data

Also, they are not good for extrapolation. They can work well within the range of data used in the training

Decision trees can also create biased trees if the data has one class dominating. Hence the data set has to be balanced before fitting a decision tree

I plan to get into more details on building Decision Trees in the upcoming articles.

]]>A decision tree algorithm is one that mimics the human decision-making process. It checks an answer to a question which can have more than one answer, branches off based on an answer and then the next question is asked. This continues till all questions are answered.

For example, taking a very simplistic situation, which we mentally sort out very easily on a daily basis – taking a decision based on weather and time available as to what transport I should choose to reach a venue, will look like the below decision tree:

Is the Weather cloudy, sunny or rainy? And then the next question answered is the amount of time available to reach the destination. Based on these two questions, a decision is made on the transportation to take.

This is clearly an oversimplification of how the decision tree works. But this is just to explain what decision trees look like.

They have root nodes – the weather node here. Then, they have internal nodes like the cloudy and sunny nodes. And then you have the leaf nodes that give the final decision. There is a decision made at every node that decides the direction of the final decision.

Decision trees are clearly supervised models built based on already existing data that help create the internal and final leaf nodes based on various criteria.

As you can see it is a highly interpretable model. If you reach the decision to walk, you know it is because it is cloudy and you have more than 30 minutes to reach the venue.

Let us see a practical example, of a model that helps in predicting whether one is diabetic or not:

Here you can see that people aged less than 58 and are females with fasting sugar less than 120 – mostly do not have diabetes. And people above 58 and males with fasting sugar <= 180 mostly have diabetes. There are many intermediate stages where the decision might change to having or not having diabetes. Also, a majority class at that node decides the class of that node bringing in only a particular level of accuracy in the prediction.

To get more and more accurate, maybe you go on till every node has only one data point and that is accurately predicted. This would then be a complete case of overfitting with 100% accuracy on the train data and may perform very poorly on any test data. This has to be avoided by finetuning what are called hyperparameters that are discussed later.

1. It is a supervised algorithm – meaning that it learns from already classified data or data which already has the variable that needs to be predicted

2. It is also called a greedy algorithm. It maximises the immediate benefit rather than taking a holistic approach. The greedy approach makes it vary drastically with even small variations in the data set. Hence, it is called a high variance model.

3. This happens in a top-down approach as well.

As you can see, we recursively split the data into smaller data sets. Based on what? Based on some features and the values of those features.

The data has to be split in such a way that the homogeneity or the purity of the subset created should be maximised. However, how do we decide which feature to split by, first and which feature goes next? How do we decide what is the value based on which the split threshold is decided? How long do we go on splitting i.e. what is the stopping criterion?

There are what are called Hyperparameters that help in making most of the decisions that need to be made.

Hyperparameters are simple parameters that are passed to a learning algorithm to control the training of the model. This is what helps in tuning the behaviour of the algorithm.

For example, in the Decision tree model, when the model is instantiated, a hyperparameter that can be passed is the max_depth – the number of levels that you want to split and train up to.

So, if you give max_dept as 5, the splits will happen only up to 5 levels even though the accuracy may be questionable. Hence there is a lot of power in the hands of the model designer to tune the learning to get a better result.

Similarly, there are many more hyperparameters that we will see with more examples in later posts which will make this concept clearer.

Just to get a peek into the possible hyperparameters, look at this piece of code:

Here the Decision tree classifier has been given one hyperparameter i.e. max_depth. the others are left as default. But the others that can be given are max_features, max_leaf_nodes, min_impurity_decrease, min_impurity_split etc.

Tuning the values of each of these can control overfitting and yet improve the accuracy of predictions on new data.

Hope this gave you a high-level introduction to decision trees.

Just to reiterate the problem statement, an NGO that is committed to fighting poverty in backward countries by providing basic amenities and relief during disasters and natural calamities has got a round of funding. It needs to utilise this money strategically to have the maximum impact. So, we need to be able to choose the countries that are in dire need of aid based on socio-economic factors and health factors.

I would highly recommend that you go through my __article on K-Means__ to understand the solution thinking, data cleansing, exploratory data analysis and data preparation steps.

Here I would like to just touch upon the Hierarchical modelling aspects instead of the K-Means algorithm used in the __previous article on K-Means__.

In the Notebook, steps 1 to 4 are all around data understanding, cleaning and preparation which remain the same irrespective of the type of clustering that we are aiming to work with. these steps have all been detailed in the __K-Means Clusering article__, already mentioned.

Here I go directly to Step 5.

I use the scipy library here instead of the normally used scikit in earlier examples.

So, the three imports I have done are:

**linkage** is a library that allows you to choose the type of linkage, which has been recently discussed in my article on __types of linkages for Hierarchical clustering__. One has to keep in mind the size of data on hand and the order of complexity of computation that would be required to arrive at the clusters while deciding the linkage type. of course, you also want as distinct a set of clusters as possible. This is the balance that has to be ensured at this step.

The **dendrogram** routine in the scipy package helps you visualise the dendrogram created by the hierarchical model. The **cut_tree** routine helps in creating the clusters by cutting the dendrogram into the number of clusters you want to get.

As seen in the __article on linkages__, the single linkage model is created by just one line of code:

Then the dendrogram obtained is:

Clearly, this is hardly interpretable and clean. This relies on taking the smallest distance between clusters as the measure of dissimilarity.

We then try complete linkage to see if we get a better dendrogram:

This creates a much cleaner dendrogram and you can see at what level you may have clear distinct clusters formed. You could choose to have 2,3 5 or even 6 clusters depending on your business case.

In the jupyter notebook, I have first decided to go with 3 clusters and use the cut_tree routine to achieve the same:

Then, I assign the cluster id so obtained, to the country dataframe as seen here:

And if I were to count how many countries are in each of the clusters, I see this:

When I profile these clusters, I realise that cluster 0 is the one containing poor nations that need aid and there are 50 countries here. That is not helpful for me to get back to the CEO saying the money on hand is needed for 50 countries. No one would benefit from this.

Hence I now move to cut the tree for 5 clusters. This improves the numbers for me. How do I understand this?

Let's have a look at some of the profiling steps.

I plot a scatter plot of the 5 clusters as shown here:

There are a whole host of countries represented by the red dots that seem to have a very low GDPP and high child mortality, similarly low income and child mortality and finally low income and low GDPP.

We can get another view by plotting a bar graph:

The scale of GDPP and income of the better-off countries is so large that the child mortality numbers are hardly visible. In spite of that, you can see it is clearly visible in cluster 0. Hence that seems the cluster with the poorest nations.

However, how many countries are part of cluster 0, let us check.

There are 38 of them. The 50 earlier have been split into cluster 0 with 38 and cluster 3 with 12.

Let us also get an idea of the spread and the median of the 5 clusters around GDPP, income and child mortality by plotting box plots:

Absolutely clear that the child mortality spread and the median is high for those countries with the lowest income and GDPP.

Then, we can prioritise amongst these countries by sorting on child mortality, GDPP and income as they seem to be the indicators that we can choose for prioritisation:

The top 10 list looks like this:

With this, we have some conclusions to represent the data and the suggestions to the CEO on the utilisation of funds.

Each of these pieces of code is simple and very easy to understand and execute in a Jupyter notebook. Do try it out yourself.

Wishing you a great time exploring the code and adding your own nuances to it.

Once again the data and the code is available in my git repo at

__https://github.com/saigeethamn/DataScience-HierarchicalClustering__

There was a mention of "Single Linkages" too. The concept of linkage comes when you have more than 1 point in a cluster and the distance between this cluster and the remaining points/clusters has to be figured out to see where they belong. **Linkage is a measure of the dissimilarity between clusters having multiple observations. **

The types of linkages that are typically used are

Single Linkage

Complete Linkage

Average Linkage

Centroid Linkage

The type of linkage used determines the type of clusters formed and also the shape of the dendrogram.

Today we will look at what are these linkages and how do they impact the clusters formed.

Single Linkage is the way of defining the distance between two clusters as the minimum distance between the members of the two clusters. If you calculate the pair-wise distance between every point in cluster one with every point in cluster 2, the smallest distance is taken as the distance between the clusters or the dissimilarity measure.

This leads to the generation of very loose clusters which also means that the intra-cluster variance is very high. This does not give closely-knit clusters though this is used quite often.

If you take an example data set and plot the single linkage, most times you do not get a clear picture of clusters from the dendrogram.

From the plot here, you can see that the clusters don't seem so convincing though you can manage to create some clusters out of this. Only the orange cluster is quite far from the green cluster (as defined by the length of the blue line between them). Within the green cluster itself, you cannot find more clusters with a good distance between them. We will see if that is the same case using other linkages too.

As you can recollect, the greater the height (on the y-axis), the greater the distance between clusters. These heights are very high or very low between the points in the green cluster meaning that they are loosely together. Probably, they do not belong together at all.

*NOTE: A part of code used to get the above dendrogram:*

Can we get better than this with other linkages, let us see?

In Complete Linkage, the distance between two clusters is defined by the maximum distance between the members of the two clusters. This leads to the generation of stable and close-knit clusters.

With the same data set as above, the dendrogram obtained would be like this:

Here you can see that the clusters seem more coherent and clear. The orange and green clusters are well separated. Even within the green cluster, you can create further clusters in case you want to. For example, you can cut the dendrogram at 5 to create 2 clusters within the green.

Here you can also note that the height between points in a cluster is low and between two clusters is high implying that the intra-cluster variance is low and inter-cluster variance is high, which is what we ideally want.

*Note: Code used for the above dendrogram:*

In Average linkage, the distance between two clusters is the average of all distances between members of two clusters. i.e.e the distance of a point from every other point in the other cluster is calculated and the average of all the distances is taken.

Using the same data set, an average linkage creates the clusters as per the dendrogram here.

Here again, you can note that the points within a suggested cluster have a very small height between them implying that they are closely knit and hence form a coherent cluster.

*Note: Code used for this dendrogram:*

One thing for sure is that the K-value need not be pre-defined in Hierarchical clustering. However, when you start building this tree (dendrogram), the point in one cluster cannot move. It is always in a cluster that it already belongs to and more can add to it but cannot shift clusters. Hence this is a linear method. This is a disadvantage sometimes.

Also, note that each linkage calculation is pair-wise between clusters and hence requires a huge amount of calculations to be done. The larger the data, the more the RAM, as well as the, compute power.

Between the three types of linkages mentioned above, the order of complexity of calculations is least for Single linkage and very similar for average or complete linkage.

References:

Order of complexity for all linkages:

__https://nlp.stanford.edu/IR-book/completelink.html__An online tool for trying different linkages with a small sample data:

__https://people.revoledu.com/kardi/tutorial/Clustering/Online-Hierarchical-Clustering.html__Single-link Hierarchical Cluster clearly explained:

__https://www.analyticsvidhya.com/blog/2021/06/single-link-hierarchical-clustering-clearly-explained/__

Today we shall delve deeper into **Hierarchical clustering**.

In K-Means, when we saw some of the practical considerations in__ Part 3__ we saw that we have to start with a value for K i.e. the number of clusters we want to create out of the available data. This is an explicit decision the modeller has to make. We also saw some methods to arrive at this K like the silhouette analysis and the elbow curve. However, this is in a way, forcing the data into a pre-determined set of clusters. This limitation is overcome in hierarchical clustering.

This is one distinct feature that makes it more advantageous than K-Means clustering in certain cases, though this becomes a very expensive proposition if the data is very large.

Now, if we do not have an upfront K value, we have to have some measure to help us decide whether a point creates a cluster on its own or it is very similar to other points in an existing cluster and hence belongs there. This is what is called the **similarity or dissimilarity measure**. **Euclidean distance** is the most common measure used as a dissimilarity measure. If the points are very similar, their euclidean distance would be very small and that implies they are very close to each other. In other words, **the points with a very small dissimilarity measure are closer to each other**.

Let us start with a sample data of 14 points, given here and see how the clustering is done.

Let us plot a scatter plot of this data and see if there are any visible clusters. Since we have only two dimensions, we have the luxury of visual representation.

When we move to actual industry problems, we would be dealing with much higher dimension data and hence visually, we can only do exploratory analysis with pair-wise data to get a feel for the clusters, at the best.

Looking at the plot above, we do see two-three clusters at least. Let us find out through hierarchical clustering, how many clusters are suggested and how close they are to each other.

The first step would be to assume that every point is its own cluster as shown in the diagram on "Startin Step". Since we have 13 data points, we are starting with 13 clusters.

Then, we calculate the euclidean distance between every point and every other point.

This will help us create the first cluster between the two nearest points.

In this case, points 10 and 11 are the nearest and they form the first cluster as shown in iteration 1.

Now you treat these two points as one cluster and so you have 12 clusters at the end of the first iteration or at the beginning of the 2nd iteration. During the 2nd iteration, point 9 joins the first cluster containing points 10 and 11, as that is the next nearest, as shown in Iteration 2.

This process goes on iteratively and the clusters that are closest to each other keep getting fused till we get one large cluster containing all the points. This is called **Agglomerative Clustering or AGNES (Agglomerative Nesting).**

This iterative process leads to the formation of a tree-like structure called the **dendrogram**. You can see the dendrogram for the above data in the figure below:

The first step is indicated by the smallest orange line joining points 10 and 11. The second iteration is represented by the orange line that joins 9 into the same cluster. In the third iteration, points 1 and 2 form their own cluster. You go by the height to know which is the next data point that formed a cluster.

The **height of the dendrogram** is a measure of the dissimilarity between the clusters and in this case that is the euclidean distance between the points. Based on this you can see that the dissimilarity is very low between points 5 and 6, 0 and 8, 1 and 2 respectively and they quickly form their own clusters on each subsequent iteration. The height at which they join also is very small indicating that the euclidean distance is very small between them.

When you see long lines that create branches way below, you know that the dissimilarity measure or the euclidean distance is very high between them and hence they form very distinct clusters. In the above diagram, the blue line is clearly indicating the dissimilarity is very high between the orange and the green clusters.

But how do we calculate the distance between a cluster containing many points (more than one) and another data point outside the cluster? This leads us to the concept of linkage.

**Linkage is a measure of dissimilarity between clusters having multiple observations. **

In a multi-point cluster, distance is calculated from every point in the multi-point cluster with an external point and the minimum distance is taken, often. This way of calculating the distance is one type of linkage called **single linkage**. This is not the best though in terms of good quality of clustering but is chosen because of the compute efficiencies involved.

There are other types of linkage that we can see in a subsequent post.

Having now gone through the AGNES process, we finally have to decide how many clusters to create and what are they? How do we identify the clusters using the dendrogram?

You do a horizontal cut on the dendrogram and every group below becomes a distinct cluster. If you cut at a height of 7 along the Y-axis, you create two clusters as shown in orange and green.

If you cut at a height of 5, you get three clusters consisting of data points12-9-10-11 in one cluster, 4-5-6-7 and 0-8-3-1-2 in the other two clusters

Libraries that exist for hierarchical clustering allow to "cut-tree" like the above and come up with varying number of clusters.

Hierarchical clustering is used when we do not want to decide the number of clusters upfront. It is computationally intensive due to the way linkages are established between the points to create the clusters.

There are two ways of hierarchical clustering, **AGNES or Agglomerative Nesting **and **DIANA or Divisive Analysis, **both of which have been discussed in the __Introduction to Clustering Algorithms__. While the former starts with each point as an independent cluster and goes iteratively to have all the points in one large cluster, the latter does the opposite.

In other words, bottom-up clustering is AGNES and top-down clustering is DIANA.

Most often the dissimilarity measure used is Euclidean distance. Linkages are the way this is calculated for multi-point clusters. This in summary is all about Hierarchical Clustering.

Let us look at one practical problem and the solution

An NGO that is committed to fighting poverty in backward countries by providing basic amenities and relief during disasters and natural calamities has got a round of funding. It needs to utilise this money strategically to have the maximum impact. So, we need to be able to choose the countries that are in direct need of aid based on socio-economic factors and health factors.

Let us use K-Means Clustering to create clusters and figure out the countries that are in greatest need as per the data provided.

You may find the data and the entire code in this git repo:

__https://github.com/saigeethamn/DataScience-Clustering__

If we want only the top 5 or top 10 countries that deserve aid, then we could think of a regression model. But we could also use clustering as a way to find out the cluster of most needy countries. Once we get the clusters, within that we could further analyse and decide where does the aid go.

I could have done K-Means Clustering or Hierarchical Clustering. I will go with K-Means for now, as we have understood that, in theory so far.

So, how and where do I start? I will be following the preliminary steps outlined in my previous post on "__Steps towards Data Science or Machine Learning Models__"

In this post, I will not explain the code for data analysis or preparation. I will just explain the bare minimum through plots and insights, as this code is pretty repetitive for all analyses. However, the K-Means part alone, I will walk through the code too.

I have to load the data and understand it first. From the shape, I know that I have data of 167 countries and I have 10 columns of data including the country name. A brief description of the data is here:

Note that exports, health and imports columns are given as % of GDPP and hence they need to be converted back to absolute values for further analysis. You can refer to the notebook in the git repo to understand how that is done.

When I do a null value check, I do not find any missing data. Hence there is no null value treatment required and no columns or rows to be dropped either. All the data is numerical and so no categorical data encoding is also required.

Here, I should also do outlier analysis and treatment. However, I am interested in exploring the original data before I treat the outliers if any. Hence I move on to Step 3 consisting of EDA.

The main steps here are univariate and bivariate analysis. I plot the distplot for all the data as shown here:

Most are left-skewed implying a large number of countries probably are in that cluster and there is a small number in the far-right cluster - behaviours of these 6 features:

Child Mortality

Exports

Health

Imports

Income

Inflation

Life expectancy, total fertility, income and gdpp show there are visible clusters. For Bivariate analysis, a heat map and pair plot are sufficient as all the data in continuous data

From this, I see that

There is a high positive correlation between GDPP and income, life expectancy, exports, health and imports.

There is a negative correlation between GDPP with Total fertility, Child Mortality and inflation

Exports, imports and health are highly correlated

Health is negatively correlated with Total Fertility, Inflation and Child Mortality

There is a strong correlation positive between Total fertility and Child mortality

Also a positive correlation between income and life expectancy

Hence**, **we have a good chunk of correlated data that should help in creating clusters. a scatter plot also helps is see if there are any visible clusters and hence you do a pair plot like this one:

Having understood the basic data, we move to the next step of data preparation.

Since all the data is continuous data, we can look at box plots and see if there are outliers.

From this, we see that child mortality, exports, imports, income and inflation have outliers on the higher end and life expectancy at the lower end. Need to be watchful about capping the high-end values of data like inflation, child mortality and lower-end values of life expectancy as the needy countries should not lose out on aid due to this.

However, it is safe to cap the higher end values of income, exports, imports, gdpp. Hence I have chosen to cap the higher end at 0.99 quantile and the lower end to 0.1 quantile.

Not capping 'health' as it has almost continuous values outside the 100th percentile and that itself could contribute to a cluster. Again refer to the notebook in git to view the code for this.

Next, we scale the variables using a StandardScaler from sklearn.preprocessing library. Here, we do not split data into train and test as we are finding clusters across all of the data. It is not supervised learning and we do not test predictions against any target variable.

In __part 3__ of the theory on K-Means, I have spoken about having to check for the cluster-tendency of the given data. So, now we run the Hopkins test to check if this data shows a cluster-tendency.

A basic explanation for Hopkins statistic is available on __Wikipedia__ and a more detailed level discussion is available __here__. It compares the data on hand with random, almost uniform data and comes up with whether the given data is almost as uniform or shows a clustering tendency. For this data, as seen in the code, we get a value of anywhere between 0.83 to 0.95 indicating that there is a possibility of finding clusters and hence we go ahead with K-Means Clustering

The first step in modelling is to figure out what is correct K for our data since we want to initialise the model with K. Again as mentioned in __part 3__, this is done using the elbow method or the silhouette analysis.

First, let's see the code for KMeans clustering with a random k. The code for clustering itself is literally 2 lines.

We have to import the KMeans library from sklearn.

If we choose to go with any arbitrary number for K and create the cluster, here's how the code would look:

We instantiate an object of KMeans class as kmeans. There are 2 args we pass: one is k i.e. the number of clusters we want to create. Here it has been arbitrarily chosen as 4. The second is the maximum number of iterations the algorithm has to go through. Recollect that 2 steps of calculating distance and reassigning points to a centroid happen iteratively. These two steps may not always converge. So, in such a case, stopping after 50 iterations is what the max_iter stands for and returning the clusters formed at the last iteration. There are a lot more arguments that you can look at help and understand. But this is the bare minimum for invoking the KMeans algorithm. Then you just take this kmeans instance and fit it against the scaled country data and four clusters are formed. It is as simple as that!!

However, deciding the value of K is a very important aspect and let us see how we decide the optimal number of clusters.

We create multiple clusters starting with k=2, and going on with 3, 4, 5 and so on. When adding more number of clusters is not beneficial, we stop at that point. We start with using K-Means clustering with K=2 to say 10. Here's how the code looks.

K-Means algorithm used is from the library sklearn.cluster

And the plot we get is:

Now, let us understand what we are doing in the code.

Here we call the fit method on KMeans for each k value ranging from 2 to 10 and create the model. And then we use an attribute from the model to understand which value for K gives good clusters.

KMeans algo has an attribute called intertia_ which you can see in the __sklearn documentation__ or by executing the command help(KMeans) in your jupyter notebooks. inertia_ is defined as the "sum of squared distances of sample to their closest cluster centre". So, if you have 3 clusters centres and each point is associated with one of them, then the squared distance of each of the points with their respective centres is given by inertia_. In fact, this is the cost function that we want to minimise as discussed in __part 2 __of my series on KMeans Theory.

So, we capture this for every k value in the range - in a list variable called ssd:

And the next set of statements plot the value of inertia_ against the k value. So, wherever we get a significant dip in inertia_, we take that as the k value of choice. After a particular k the inertia_ does not show any significant improvement.

So, we see that there is a sharp dip in ssd from K=2 to K=3. Then the rate of fall slows down from K=4. It further slows down with higher Ks. Because of the shape of the curve at K=3, it is called an elbow curve. Given this insight, we could choose K as 3.

Now let us look at Silhouette Score too.

Broadly speaking, It is a measure of the goodness of the clusters created. We have understood in __Part 1 __of the series on KMeans that we want the maximization of the inter-cluster distance and the minimization of the intra-cluster distance. This is what is encapsulated in the silhouette score.

In other words, a metric that measures the cohesiveness of a cluster and the dissimilarity between clusters is called the silhouette score.

It is represented as follows:

where

*p is the average distance to the points in the nearest cluster that the data point is not part of.*

*q is the average intra-cluster distance to all the points in its own cluster.*

Let us understand the intuition behind this. By definition of maximization of inter-cluster distance, p should be as large as possible and by minimization of intra-cluster distance q should be as small as possible. If q is very small, the ratio is almost p/p and hence 1. If q is very large, the ratio is -q/q and hence -1

Therefore, the silhouette score combines the two (p and q) and ranges from -1 to 1. A score closer to 1 indicates that the data point is very similar to other data points in the cluster and a score closer to -1 indicates that the data points are not similar to other data points in its cluster.

This is calculated for every point and for every K. Then the same is plotted on a graph for every k.

So, whichever K has the maximum silhouette score is the one with the best inter-cluster heterogeneity and intra-cluster homogeneity. The silhouette scores seem to be very similar for k = 3, 4 or 5.

Here is the code that is written to calculate and plot the silhouette score

Here we use the silhoutte_score from sklearn's metrics library. We create the KMeans clusters for each K in the range 2 to 10. And to the silhouette_score, we pass the scaled country data and the labels returned by KMeans to help in calculating the intra and inter-cluster distance averages for every point in this line:

Finally, we gather the score against each k in the list named ss[] to help in plotting the graph.

Based on both of these tests, looks like 3 seems to be the right number of clusters. So, we will go ahead with this value of K and create 3 clusters.

Then we do the cluster analysis to see what direction or insight we get out of it. This cluster Profiling or analysis can help us finally say which are the countries that are in direst need of aid.

Let's now start with understanding how many countries are in each cluster:

From the above code, you can see that there are 90 in cluster 2, 29 in cluster 1 and 48 in cluster 0.

Let us now plot a scatter plot with the 3 most important variables: income, GDPP and Child Mortality, for each cluster:

You can see the countries represented in red dots have low GDPP, low income and high child mortality. They would be the countries that would best benefit from aid.

We can plot a bar graph and box plots to understand whether these clusters are truly distinct in their characteristics.

The bar graph shows that the gdpp and income are quite different for the 3 clusters. The box plots show how the mean of gdpp and income is very low for cluster 0 while the child mortality is very high. This makes the profile of the countries very clear.

Since we want only the top 10 countries, we can sort by gdpp, child mortality and income in the ascending, descending and ascending orders respectively and take the top 10 countries for providing aid:

If the budget was able to support, the entire list of 48 countries in this cluster could have been considered as each is only slightly better than the other. However, with budgets constraints, the top 5 to 10 needy nations could be considered so that the impact is at least felt and a difference is made to the people receiving the aid.

Finding K for K-Means is an important pre-step for clustering. There are multiple methods to find the appropriate K. Cluster Profiling helps us derive more insights. However, data understanding, data preparation and transformation before clustering are also important steps that cannot be overlooked.

Developing a model is mostly the easiest step. However, again deriving insights from the model to get actionable results requires a deep enough understanding of the problem on hand and the implications at the ground level.

The entire code for this is available in the git repo whose link is given above. With more details in comments and explanations at each step. This is a fairly simple problem that was addressed through KMeans clustering.

Hope this was a useful code walkthrough with a significant example.

So, today instead of going into K-Means modelling, I thought, why not look at the steps we necessarily (may not be sufficient though) indulge in, before modelling of any sort.

Many want to learn Data Science and Machine Learning. And there is enough and more material available on the internet to learn. And sometimes that becomes the problem.

Libraries out there like __scikit-learn__ make machine learning look like child's play until you start solving real-world problems. It typically consists of 2 steps - fit and predict. The fit() method fits against the available data creating a model. Then you use that model to predict against new or unseen data. Doesn't that look so simple?

In fact, here is a snippet of the code from the scikit-learn library which shows the simplicity of the exercise and the coding involved:

Four lines of code - in which 1 line creates the model, 1 line uses the model to predict. Then, is data science so simple and easy?

Much of the work is all before the model creation itself.

So, today I would like to list out a set of basic steps that have to be done before you get into modelling. This is just an indicative set of steps, nowhere exhaustive, but can serve as a starting point for modelling many algorithms that are fundamental to data science.

A birds-eye-view of the steps involved is provided in this mindmap:

The first step is to explore your data and understand it. Then you clean and do some basic transformations. After this, you are in a position to do a detailed Exploratory Data Analysis that gives your deep insights into your data. And finally, you prepare the data as expected by the algorithm.

This involves understanding the size, shape, data type, column names, the multiple sources of data. Here we also take a look at which data is categorical in nature and which is continuous.

You could even get some basic statistics like the minimum, the maximum, the average, the 75th percentile data, to get a feel for the spread in the numerical data.

First, you check for null values and see if you can treat them meaningfully. Else you drop that data as it could create problems later on.

This means you drop columns that have a high percentage of null values. For other columns with nulls, based on the data and the meaning of the column, you can impute it using various mechanisms, simples of them being to impute with a 0 or mean or median. There are advanced techniques too that can be employed to impute. Sometimes it may be good to leave them unimputed as you do not intend to skew the data in favour or against a value. You may choose the drop the specific rows, instead.

Coming to the transformations at this stage, it is to ensure your data is transformed to allow for a meaningful exploratory data analysis.

Firstly, you can plot graphs and check for outliers. If there are any, treat them as detailed in my article on __treating outliers____.__

You may choose to create new variables through binning or through derivations from existing variables.

You can also transform categorical variables into numerical through techniques like one-hot encoding or label encoding

Now you are ready to start Exploratory Data Analysis

This is a very important step. Here is when you really get more familiar with your data through data visualization and analysis. You can see patterns and correlations between the predictors and the target variable or between the predictors themselves.

Broadly these are the steps involved in EDA.

Univariate Analysis

Bivariate Analysis

Understanding correlation between data

Plotting and visualising the data for any of the above steps

Checking for imbalance in data

You do univariate analysis for categorical variables using bar charts and continuous variables using histograms as shown here:

You could even draw box plots to understand the spread of the data from a different view.

Then you do a Bivariate analysis that analyses the relationship between two variables. It can include heatmaps for correlation analysis, pair plots for continuous-continuous data relationships, box plots for categorical-continuous relationships and bar plots for categorical-categorical relationships as shown here.

You also check for imbalance in the target data so that you can model it through the correct means. At every stage, you draw some interferences about the data on hand.

You can see all of the above steps in detail through data and coding in this git hub repo on __Exploratory Data Analysis__. Sometimes the above steps may be slightly iterative.

Next, you start preparing the data for the model

If you have multi-level categorical variables, you create dummy variables to make them numerical columns.

You also need to scale numerical features. Typically, you scale features after splitting your data into train and test data so that there is no leakage of information from the test data into the scaling process. The various __scaling techniques __have already been discussed.

If you have a huge number of features, you could also go through feature selection through various techniques discussed earlier in __Feature Selection Techniques __including Recursive Feature Elimination, Manual elimination, Balanced approach and even Principal Component Analysis that I have not discussed so far.

Now that you have selected the features of importance, split the data into train and test sets and further scaled the data, you are finally ready to get into modelling.

Without all of these steps and may be more, if we do modelling, it would be the case of garbage in, garbage out. You may get very unstable models or actually run into errors thrown up by libraries as certian assumptions are violated.

The above process is often iterative in nature and goes on improving the data and the knowledge and insights from the data as you go through the same. It is only after this that you can start using the modelling techniques provided by various libraries and derive the benefits from the same.

It is often said that a data scientist spends a majority of his/her time on these steps more than in modelling itself. Without knowing any of the above, it would be futile to just learn modelling using libraries.

So, far we have just seen the basics of K-Means algorithms. They certainly help in the unsupervised clustering of data. However, we must realise that not all of this is completely foolproof.

There are a lot of decisions and conscious considerations to come up with useful clusters. Let us look at some of the important ones.

We have been saying that we will start with "K" clusters. But how do we decide what is the right value for K? Should we create 2, 3, 5, 10 clusters? What is the right number for K?

The first thing that comes to our mind is to look at scatter plots like the one here and it will be obvious as to how many clusters exist! Isn't it?

Yes, when we have a plot like the one here and we see the intuitive number of clusters, it is very easy to say that the data has 3 clusters and so K should be 3. In reality, however, data is rarely just 2 dimensional or even just 3 dimensional. The moment we move to data with multiple dimensions, visual representation is always not possible and we have to have other means to decide the number of clusters that we want to create.

So, what we are saying here is that **we want to find out the natural number of clusters** that already exist in my data without visually being able to see it. That is one criterion for sure, to decide on K.

For this, there may be many methods that aid you but two of those methods I would like to mention here: **silhouette analysis** and **elbow method**. They help in coming up with the right number of clusters using a quantitative method. They indicate the natural or intuitive number of clusters that exist in the data.

However, apart from the above mentioned qualitative methods, the business or domain knowledge would also have to be used to decide the K or the number of clusters. Even if silhouette analysis says 3 clusters exist, you might believe that it makes sense to have 4 clusters based on your business experience and knowledge. So, you could go with 4 and see you get the benefits of clustering from that.

Again, we have so far randomly chosen the initial cluster centres. If we choose the initial centroids randomly, we should be aware that we may not always end up with the same set of clusters!! Don't believe me?

Let us take an example data set and a random set of starting centroids. I chose completely different centroids to start with for the below data and the clusters I obtained were very different each time. The pink and blue are the clusters obtained each time I started with different random centroids.

*This was obtained using **https://www.naftaliharris.com/blog/visualizing-k-means-clustering/,** which provides a very good simulation for the k-means algorithm with various sample data.*

So you see, in certain types of data, the initial centres can have an impact on the clusters formed later. The clusters can keep varying. Hence, the initial centroids have to be chosen wisely.

So, what is the criteria or intelligence to be used to decide the right set of centroids - to begin with? One of the standard ways of choosing the initial centroid points is by an algorithm called the **K-Means++ algorithm.*** (which probably we can address in some later blog)*

At a very high level, I can summarise what this algorithm does is to help you pick up the farthest points possible as the initial centroids - again through distance calculations. And this is quite an intensive process if the data set is large.

From the above example scatter plot, you would have wondered if there are any natural clusters at all in the given data. In fact, that is uniformly spread, random data and hence is not suitable for forming clusters. So, before we jump into clustering any data, we have to check for what is called the "**Cluster tendency"** of the data. If the data shows no cluster tendency, there is no point in trying to cluster the data. It will be a futile effort. So, how do you check the **Cluster tendency**?

The Cluster tendency is given by a statistic known as the "**Hopkins statistic**". If the value of the Hopkins statistic is closer to 1, then, the data is said to be clusterable and if it is around point 0.5, the data is uniformly distributed - as in the above example with no cluster tendency. If the statistic is 0, then the data is in a grid format, again with no possibility of meaningful clusters. Hence you would look for a Hopkins statistic that is close to 1 before we embark on clustering.

For most algorithms, we know outliers play havoc unless treated. So it is in the case of K-Means clustering too. We must recollect that K-Means is heavily dependent on the calculation of means. And an average or mean is always impacted by outliers.

If there are outliers, they tend to pull the cluster away from its natural centre and sometimes probably bring along points that naturally belong to another cluster. This cannot be understood, till we treat or remove the outliers. The farther the outlier, the more impact it has on the homogeneity of the cluster and hence the formation of the right clusters.

Take a look at Figures 1 and 2. Figure 1 has no outliers. Figure 2 has one outlier at (60,2). You can notice that the natural clusters are so well-formed in the first case while the clusters are totally skewed in the second case. All of the data looks like 1 cluster and the outlier on its own as another, though it takes 2 more data points with it to form the 2nd cluster. Clearly, the intra-cluster homogeneity is very low here and the inter-cluster heterogeneity is also low, defeating the aim of clustering itself.

Hence it is quite imperative to treat outliers before clustering of data is undertaken. The various methods that can be employed to treat outliers are already discussed in an earlier blog of mine on the __Treatment of Outliers__.

Data on different scales is another aspect that impacts clustering considerably. Recollect that this algorithm works on distances. If some data is on a very large scale and some of the other data is on a much smaller scale, the distance calculation would be dominated by the dimension that is on a large scale and again impacts the cluster formation.

Though I am not getting into examples here, it is best to standardise all the data to a common scale. It means to have a 0 mean and 1 standard deviation to ensure that equal weightage is given to all variables in the cluster formation. Their spread and other characteristics remain the same but are scaled to a comparable scale. You can look up the various methods of feature scaling, as it is called in my earlier blog on the __importance and methods employable for feature scaling__.

The K-Means algorithm itself works on the concept of distance calculations and hence would not work with categorical variables. You might have to look at algorithms like the **K-Mode **algorithm for such data.

K-Means works well only with spherical clusters and not with any other type of clusters that are non-spherical. For example, if there are two cylindrical data points parallel to each other. because it is based on distance minimisation, the clusters would be ill-formed in this case.

It does not have the ability to learn by itself what the right K-value needs to be. this has to be given as input to it.

It will go ahead and cluster any type of data even though there are no clusters naturally available. For example, even if we gave uniformly distributed data, it went ahead and created 2 clusters in the above example in Consideration 2.

K-Means also gives more weightage to bigger clusters compared to smaller ones.

K-Means is an easy-to-understand algorithm and even easy to implement as long as all the practical considerations, discussed are kept in mind. It has two main steps of assignment and optimisation that are iteratively applied on the given data points till the centroids do not change anymore. A very simple and yet very powerful and useful algorithm in the unsupervised algorithm category.

We also looked at the mathematical representation of each of these steps and the cost function that we are missing here. This should give a good theoretical understanding of the K-Means algorithm.

Also, it has a few drawbacks that need to be kept in mind before using it for clustering. I hope to take you through an example with code in some later posts.

References:

We know in Machine Learning, we need to understand what is the cost function that any algorithm is trying to work with, so that we can either minimise or maximise it. And further, automate it.

In this part 2, we would look at the mathematical representations of the cost function and the two steps of assignment and optimisation.

Straight away, I would start with the cost function represented by **J** and then explain it. The cost function is:

where K represents the number of clusters and “k” would be the cluster number to which the data point “i” belongs. k would range from 1 to K. Therefore for every “i” in cluster “k”, the Euclidean distance (square of the difference) between the ith data point and the centroid (represented by the mean mu-k) is taken and summed over all the points in that cluster. This is repeated for every cluster in the K clusters and the summation is done over all the K clusters.

Therefore, this is represented as two summations. One summation is the sum of all distances of points within one cluster from the centre. The next level summation is the sum of all the distances across all clusters.

Hence, **J** stands for the total squared error across all the clusters w.r.t their own centroids and that is what we want to minimize. If the clusters are very tight and good, then the overall cost function value will be very low. And that is what we are aiming for. In layman terms, that would mean that all the data points of a cluster are very close to the centroid.

Let us look at the similar mathematical representations of the 2 important steps: the **assignment** step and the **optimization** step.

This is the step where the distance between the centroid and every data point is calculated. This can be represented as (a function of the points in a cluster and the cluster mean):

given by

This distance is calculated K times for each point if there are K clusters and hence K cluster centroids. So, if we have 10 data points and 3 cluster centroids, we start with data point 1 and calculate 3 distances of point 1 from the 3 cluster centres. And the minimum distance among them is taken and the point 'i' (which is 1 in this case) is assigned to that cluster, say cluster 2. Likewise, we take data point 2 and then calculate the 3 distances from the 3 centroid points and assign data point 2 to the cluster which has a minimum distance from this data point.

So, what we are trying to do here is calculate the minimum (argmin) of the distance of every point with every cluster centre, shown as

This is repeated for every data point and the data point is assigned to the cluster whose centre it has a minimum distance with. In the above example, we would have calculated 10 points x 3 centroids = 30 distances before we have assigned all the 10 points to 3 clusters.

Now, having created the clusters based on the distance calculations, you recompute the cluster centre mu-k for K clusters. You calculate the mean or the cluster centre as shown here:

Where n-k is the number of data points that belong to the kth cluster and x-i are all the points that belong to the kth cluster. The mean is taken for only those data points that belong to that cluster. So, if cluster 1 had 5 points, the mean of all these 5 points is taken to come up with the new centroids. Note that the mean is taken for each dimension of the data point. If the data point is presented by x,y,z, then the mean is individually taken for x, y and z respectively to calculate the new centroid represented probably by its own x,y,z values.

This formula must look very familiar. it is nothing but the formula for mean.

This is what is repeatedly done for all data points of every cluster to get the new cluster centres of each cluster. In the example taken above, the means are found for all the 3 clusters and 3 news cluster centres are calculated.

**Stopping Point:**

The above two steps are repeated till we reach a point where the centroids do not move any further. That determines our stopping point of the algorithm and the clusters so formed are the most stable clusters.

These formulae can be used in an excel sheet, to begin with, along with the data points shared in Part 1 of this article and you can try it out on your own.

In Part 2, we have understood the mathematics that is used in K-Means. Nothing too difficult. However, there are quite a few practical considerations that impact the usefulness of the K-Means clustering. Each of those will be talked about in Part 3 of this series.

]]>We also need to understand its limitations, how to overcome them, its advantages and disadvantages.

I plan to explain the basics of K-Means clustering in a 3 part series. The first part, that is, this post, will take an example of very few data points and show how the clusters are formed.

The 2nd part to follow will talk about the cost function that is minimised in the K-Means algorithm with simple mathematical representations for the various steps.

The final 3rd part will talk about the practical considerations for the K-Means algorithm.

In K-Means the similarities between data points are decided based on distances between points and a centroid hence this is called a **centroid-based clustering **algorithm. If we want K clusters, we start with K centroids randomly chosen. Then, the distance of each point with these centroids is calculated and the points are associated with the centroid they are nearest to.

To recollect, the **centroid** of a set of points is another point having its own x and y coordinates that is the geometric centre of all the given points. This is calculated by taking the mean of all 'x' points to give the 'x' of the centroid and similarly average of all the 'y' points to get the 'y' point for the centroid. This is true, assuming that the set of given points have only two dimensions x,y. The same can be extended to n dimensions.

K-Means is an iterative algorithm that keeps creating new clusters with some adjustments till it finds a stable set of clusters and the points do not move from one to another.

The steps of this algorithm can be detailed as follows:

We start with 'K' random points as initial cluster centres.

Then, each point in the data set is assigned to one of these centres based on the minimum distance to the centre. (most often the

__Euclidean distance__)Once a cluster is formed, a new centroid is calculated for the cluster i.e. the mean of all the cluster members

Then, again distances are calculated for all the data points to the new centroids and the re-assignment happens based on minimum distance.

Steps 3 and 4 are repeated till the points do not move any further from one cluster to another nor do the centroids move too much.

Let us take an example set of data points and see how this is done.

__ Step 1__:

For the above data set, we take 2 centroid points at (10,8) and (18,6) randomly, to begin with, keeping in mind the range of the data points. These are represented by the yellow points in Figure 1.

**Step 2:**

Based on the Euclidean distance from all the points to these 2 centroids, two clusters have been formed in red and green. 7 points in the red cluster and 4 points in the green cluster. This is called the **assignment step**.

The formula for Euclidean distance is given by

**Step 3: **

Once these clusters were formed, we realise that the current centroids are not truly the geometric centres of their respective clusters. So, we find the new centroid by taking the geometric mean of all the x and y values of the 7 points in cluster 1. That value turns out to be (6.7, 6.1) as shown in figure 2.

Similarly, the new centroid is calculated for cluster 2 which shifts from (18,6) to (17.5,4.5), for the green points. This is called the **optimization step**.

**Step 4: **

Since the centroids have shifted, now it is worth recalculating the distance of all the points with respect to the new centroids. So, we calculate the euclidean distance of all the points with the new centroids again. We see that one of the earlier red points is now closer to centroid 2 or the green cluster's centroid and hence has been reassigned at the end of this cycle to cluster 2.

We repeat steps 3 and 4 (the assignment and the optimisation steps) and we get the plot as shown in Figure 3.

Here, centroid 1 has slightly shifted from (6.7, 6.1) to (5.6, 5.4) and centroid 2 has shifted from (17.5, 4.5) to (16.6, 5.6).

However, on recalculating the distances of all points with the new centroids, no points have moved from one cluster to another.

We repeat the process one more time to see if the centroids move or the clusters change.

And voila, neither of them change as shown in Figure 4. It means that we have reached an equilibrium stage and that the clusters are stable.

This is how K-Means clustering works!

Just clustering around a centroid point through calculation of Euclidean distances of all data points to the centroid and keep readjusting the centroid till it moves no further. Looks very simple, isn't it? Very easily automatable and can be represented mathematically too.

K-means clustering is an unsupervised machine learning model that essentially works through 5 steps of which two steps of "**assignment**" of points to clusters and "**optimization**" of the cluster through calculating a new centroid is iteratively repeated till we reach stable clusters.

The stability of clusters is defined by the points not jumping across clusters and the centroids not changing significantly, in subsequent iterations. The core idea in clustering is to ensure **intra-cluster homogeneit**y and **inter-cluster heterogeneity** to arrive at meaningful clusters. This has already been explained in my blog on "__Introduction to Clustering__"

In the next post, I will take you through the cost function for K-Means and a few mathematical formulae explaining the two important iterative steps. Till, then, see you :)

]]>This is an unsupervised learning technique where there is no notion of labelled output data or target data. An unsupervised method is a machine learning technique in which the models are not trained on a labelled dataset but discover hidden patterns without the need for human intervention.

A few unsupervised learning techniques apart from Clustering are Association and Dimensionality reduction.

Let us look at a few examples in order to understand clustering better.

If you are given a news feed from various news channels or portals and if you had to categorise them as politics, sports, financial markets etc. without knowing upfront, what categories exist, then, this is a typical clustering application. It may turn out that there are very standard categories that appear over and over again. Though the algorithm cannot name them automatically as sports or politics, it can cluster all the sports articles into one cluster and the political articles as another.

However, in certain periods, totally new categories may turn up. Like during the Olympics, a totally new category may be discovered as '"Olympics news" or the "Paralympics news". Being able to discover and identity newer clusters as they form and be able to categorise them as such is also a part of Clustering.

Another very common example is that of customer segmentation. If a large retailer wants to create promotions or marketing strategies based on customer's behaviour, it would need to be able to categorise or segment its customers based on their behaviours or demographics and so on. One way of segmenting could be based on spends. Another could be based on the age group related shipping habits. Yet another could be based o location and its influence like beaches versus high altitudes. It could be based on loyalists versus coupon lovers. These should be gleaned from the customer data the retailer has. Then, their promotions can be very targeted and the conversion rate could be improved immensely.

Clustering is also heavily used in the medical field like human gene clustering and clustering of organisms of different species or within a species too.

Note that in each of the cases there are no labels attached to the data. It is after the clusters are formed that you can get actionable insights from the clusters that are created.

There are many types of clustering algorithms of which here are the top 4 well-known ones:

Connectivity-based Clustering

Centroid-based Clustering

Distribution-based Clustering

Density-based Clustering

Each of them has its own characteristics with its own advantages and disadvantages. Today, I will provide a brief introduction to a couple of clustering algorithms

K-Means algorithms which is one of the well-known clustering algorithms is a centroid-based algorithm.

Hierarchical clustering is a connectivity-based clustering algorithm.

All clustering algorithms try to group data points based on similarities between the data. What does this actually mean?

It is often spoken of, in terms of **inter-cluster heterogeneity** and **intra-cluster homogeneity**.

__ Inter-cluster heterogeneity:__ This means that the clusters are as different from one another as possible. The characteristics of one cluster are very different from another cluster. This makes the clusters very stable and reliable. For example, if a cluster of customers is created based on highly populated areas versus thinly populated areas. If the difference in the population is distinct like in cities and villages, they turn out to be very stable and distinct clusters.

__ Intra-cluster homogeneity: __This talks about how similar are the characteristics of all the data within the cluster. The more similar, the more cohesive is the cluster and hence more stable.

Hence the objective of clustering is to **maximise the inter-cluster distance **(Inter-cluster heterogeneity)** **and **minimise the intra-cluster distance** (intra-cluster homogeneity )

This is one of the most popular clustering algorithms and one of the easiest as well. Here, we are looking to create a pre-determined "**K" number of clusters. **

Here the similarities or lack of similarities between data points, is decided based on distances between points. The distance is measured from a **centroid** and hence this is called a **centroid-based clustering **algorithm. If we are starting with K clusters, we start with K centroids randomly chosen (or there is more science to it). Then, the distance of each point with these centroids is calculated and the points are associated with the centroid they are nearest to. *Note that it would be ideal to have the centroids placed as far away as possible from each other as possible.*

Thus clusters of data points are formed. The centres/centroids are recalculated and again the steps are repeated till the points don't seem to be moving from one cluster to another.

The distance formula used here is the Euclidean distance between every point and the centroids. The closer the points are to each other, the greater the chance of belonging to the same cluster.

*Euclidean distance is very simple high school geometry. A very simple explanation of the same can be found **here**.*

We will look at a practical example with data in a subsequent article but for now, we can summarise our understanding as that the clusters are formed based on the distances between points in K-means where K stands for the number of clusters we have decided to create or glean from the data.

Here is a graph showing how shoppers have been clustered based on the amount they spend at the shop and the frequency of orders they place.

Notice, 3 clusters have been formed with the red showing customers who spend small amounts and shop very frequently. The black cluster shows the group of customers who shop less frequently but spend large amounts. The blue cluster shows the customers who spend small amounts and are not so frequent shoppers either, the least profitable customers.

One of the biggest disadvantages of K-Means clustering is that you have to decide or choose upfront, the K value. i.e. the number of clusters. This is overcome in the hierarchical clustering.

In Hierarchical clustering, you either start with all data as one cluster and iteratively break down to many clusters depending on similarity criteria or you start with as many clusters as your data points and keep merging them till you get one large cluster of all data points. This leads to hierarchical clusters of 2 types:

Divisive and

Agglomerative

The positive point of hierarchical clustering is that you do not have to specify upfront how many clusters you want.

It creates an inverted tree-shaped structure called the **dendrogram**. A sample dendrogram is shown here:

Because of this tree structure and the way the clusters are formed, it is called hierarchical clustering. *This figure shows the clusters created for the same customer data that was used to derive the 3 clusters of customers in the above graph under K-Means. *Here too you can see it suggests 3 clusters through green, red and turquoise clusters.

Interpreting the dendrogram is the most important part of the hierarchical clustering. Typically one looks for natural grouping defined by long stems.

The height of the dendrogram at which the different clusters are fused together represents the dissimilarity measure. Often it is the Euclidean distance again, that is calculated to understand how similar or dissimilar two points are to each other and that is represented by the height in the dendrogram. Here the distance is calculated between points themselves and not any centroid. Hence it is called connectivity-based clustering algorithm.

The clusters that have merged at the bottom are very similar to each other and those that merge later towards the top are the most dissimilar clusters. We will talk about linkages and its types in a later article that goes into more details about hierarchical clustering.

Here is a brief description of the two types of hierarchical clustering and how they differ from each other though both end up creating dendrograms.

This starts with all the data in one large cluster, from where it is divided into two clusters based on the least similarity between them and further divided into smaller clusters until a termination criterion is met. As explained earlier, this is based on connectivity which is essentially saying that all the points close to each other will belong to one cluster.

In this, the clusters are formed the other way round. Every data point is taken as its own cluster, to begin with, and then it starts aggregating them into most similar clusters till it goes up to making one single cluster of all the data available. Hence it is also known as the bottom-up method.

The difference between these two methods is pictorially represented here.

Clustering algorithms are unsupervised algorithms. There are many types of clustering algorithms each with its own advantages and disadvantages. The inter-cluster heterogeneity and the intra cluster homogeneity play a huge role in the creation of clusters.

Out of the four categories of clustering algorithms, we have looked at examples of two common types of algorithms - the K-Means and the Hierarchical clustering, at a very high conceptual level. The mathematics and the intricacies will be looked at, in subsequent articles.

However, coming back to the topic on hand, all on the Machine Learning Journey start with learning about Linear regression - almost all who are serious learners :) Initially, it seems too simple to be of use for predictions. But as you learn more and more, you do realise that it can be a solution for a good set of problems. However, can you use Linear regression for any problem on hand? or do you have a set of constraints that you need to be aware of, so that you use it in correct scenarios?

If you have read my articles so far on Linear regression, starting from

Going all the way through the various concepts used in the above articles, individually as part of these articles

there are hardly any assumptions mentioned about linear regression perse.

The only assumption, if at all, that too very implicitly is that there must be a linear relationship between the target and the independent variables. And that is the reason we are able to express that relationship as in the given equation here:

where the Xs are the independent variables and the Y is the dependent variable. The betas are what the model comes up with for the given data, of course with the epsilon as the mean zero error or the residual term.

However, this is not the only assumption that is true in the case of linear regression, there are other assumptions too to make the inferences of a model reliable. This is because we are still creating a model from a sample and then trying to use that model for a general population. This implies that we are uncertain about the characteristics of the larger population and that needs to be quantified. Hence we need to have a few assumptions about the data distribution itself.

If any of these assumptions do not turn out to be true with the data that you are working on, then the predictions from the same would also be less reliable or even completely wrong.

There are 4 assumptions that need to hold good including the one already stated. They are

A Linear relationship exists between Xs and Y

The error terms are normally distributed

The error terms have a constant variance ( or standard deviation) This is known as homoscedasticity

Error terms are independent of each other

Clearly, there are no assumptions about the individual distributions of X and Y themselves. They do not have to be normal or Gaussian distributions at all.

Let us understand the assumptions. The first one is obvious.

What does the second assumption mean?

When we are fitting a straight line for Y vs X, there can be a whole host of Y values for every X. However, we take the one that best fits the line. The actual point may not be on the line and that gives us the residual or error.

In the figure below, the e65 is the error at x = 65. e90 is the error at x=90.

Therefore, the Y at x = 65 would be

and Y has an error e65.

This error itself can be anything. But considering that we want e (epsilon) to be a mean zero error, we will be fitting the line in such a way that the errors are equally distributed either positively or negatively around the line. That is what would be deemed the best fit line.

Since in linear regression, the data points should ideally be equally distributed around the best fit line, to ensure that the mean residual is zero, this makes the distribution of errors a normal distribution.

If you plot the residuals of your sample data, this is the kind of graph you should get.

This is the second assumption

This is the 3rd assumption. This is also known as homoscedasticity. The errors have a constant variance (sigma-squared) or a constant standard deviation (sigma) across the entire range of the X values.

This is to say that the error terms are distributed with the same normal distribution characteristics (defined by mean, standard deviation and variance) through the data range.

See the patterns of residual plots in the above figure, the plot (a) shows no specific pattern in the residuals implying that the variance is constant. In such a case, linear regression is the right model to use. In other words, it means that all the possible relationships have been captured by the linear model and only the randomness is left behind.

In plot (b) you see that the variance is increasing as the samples progress, violating the assumption that the variance is a constant. Then, linear regression is not suitable in this case. In other words, this means that the linear model has not been able to explain some pattern that is still evident in the data.

If the data is heteroscedastic, then it means that

This is the 4th assumption that the error terms are not dependent on each other and have no pattern in themselves if plotted out. If there is a dependency, it would mean that you

have not been able to capture the complete relationship between Xs and Y through a linear equation. There is some more pattern that is visible in the error.

Getting a residual plot like this shows that the variance is constant as well as the fact that the error terms are independent of each other.

These assumptions are necessary to be tested against the predicted values by any linear model if you want to ensure reliable inferences.

The meaning of these assumptions is - what is left behind (epsilon/error) that is not explained by the model is just white noise. No matter what value of Xs you fit in, the error's variance (sigma-square) remains the same. For this to be true, the errors should be normally distributed and have a constant variance, with non-dependence on each other.

This also implies that the data on hand is IID data or Independent and Identically distributed data, that is suitable for linear regression.

In layman terms, all these assumptions go to say that the dependent data has a truly linear relationship with the independent variable(s) and hence it is explainable with a linear model. We are ensuring that we are not force-fitting a linear model on something that is not linearly related and that probably there exists a relationship that is either exponential, logarithmic or some relationship explained by higher-order equations.

Hence, you need to test for each of these assumptions when you build your linear models to use the inferences with confidence.

Reference:

Let us understand the nuances of each of these today.

**Prediction**, as the word says, is about estimating the outcome for unseen data. For this, you fit a model on a training data set and use that model to predict the outcome for any unseen data.

In prediction, we do not make any assumptions about the shape of the data except that there is a linear dependency of the target variable on the independent variables and that is explained by a model** f(x)** (x could be a set of predictors, x1, x2, x3,..., xn)

Once the model is known, the model is used to **interpolate** a target variable based on a new set of unseen independent variables.

Though it may not always be true, most often, we use predictive models for understanding the impact of the independent variables on the target variable. Hence you want to keep the model as simple as possible.

For example, you have a use case where the number of viewers of a TV show is reducing. You want to fit a model to understand this behaviour, based on various factors like the actors, the plot of the show, the days of the week the show airs, the competing shows that have come at the same hour etc. You do a "multiple linear regression" and get a model. As soon as you have the model, you can see which predictors are more influential and with that insight, you can take corrective actions. You can even predict based on a few tweaks, what is the impact on viewership.

Here you are not very keen on high accuracy of the prediction but you are keen on knowing the cause for the change in the outcome. Actionables are expected usually, from these predictive models.

**Forecasting** is a sub-discipline of Prediction where you are predicting for a future point in time.

For example, weather forecasting. We would not say weather prediction. Similarly, we say, sales forecasting. Given a lot of historical sales, you come up with a model, using which you forecast for a future date.

This kind of sales forecasting is valid provided the conditions remain the same as the conditions of the training data. If the training data is for a non-festive period, then, the forecast will also work for the same. But it will certainly not work for a festive period.

This implies that we making an assumption that the conditions remain the same for the forecast to hold true. If this assumption changes, then, the forecast could completely go wrong.

In fact, the language used when you give a forecast, is different. You often say that conditions remaining the same, the forecast is a specific value. This is seen as an **extrapolation** of the data from the existing time frame to a future time frame.

Also, regarding the outcome, here most often, you are looking for higher accuracy and not really for understanding the impact of the independent variables on the target variable. Hence there is a tendency to make the model complex, as the goal is different.

Let us look at a very simplistic example. You have data that shows how salary varies with years of experience. Clearly, there is a linear dependence between these two variables as shown in the diagram below

Note here, if you want to "predict" what might be the salary for someone with 6.5 years of experience, you can interpolate as shown by the green dot and line and you can derive that the salary is probably around 90000 units. Using this linear equation formula you can predict for any number of years of experience in the given range of 1.1 to 10.3 years of experience, for which we have data. You could go beyond 10.3 years too but no guarantees there as you have no data to substantiate that the prediction is still linear after 10.3 years.

Let us look at an example that has the temporal component to it.

This is a graph that shows the increase in Airline passenger traffic with time. This has a linear component and a seasonal component. For the discussion sake here, kindly ignore the seasonal component. Assume, you separate out the linear growth and the seasonal shape and you will see that over the years, there is steady linear growth. You can use this data to come up with the linear growth expected, say in 1962.

You are using all of the historical data to come up with what might be a future passenger load in 1962, which is nothing but an extrapolation. This is called forecasting.

It must strike you here that the forecasting will be accurate only if the conditions of the historical data and the future remain the same.

Most often, non-temporal use cases with interpolation are termed predictions and temporal based extrapolation is called forecasting.

Kindly share your thoughts below, it will be highly appreciated :)

It is, of course, science for all the mathematical rigour it goes through for being a selected feature. Today, we will first understand why is feature selection an important aspect of Machine Learning and then, how do we go about selecting the right features.

Suppose we have 100 variables in our data, as potential features. In that, we want to know what are the best set of features that will give the highest accuracy or precision or any metric that you are looking for. We also want to know which of these even contribute towards predicting the target variable. It will be a lot of trial and error before you can find that out.

One way would be to use a brute force method. Try every combination of variables available and check which predicts best. Implying that we try one variable at a time for all hundred variables, two at a time for all combinations within the 100 variables, three at a time with all combinations and so on. This would lead to 2 to the power of 100 combinations.

Therefore, if we even have just 10 independent variables, it would be 1024 combinations. And if the number of variables increases to 20, the combinations snowball to 1048576. hence this does not sound like an option at all.

Then, how do we go about selecting the features? There are two ways of dealing with it - Manual or automated feature elimination.

As you can guess, Manual Feature elimination is possible and done when the number of variables is very small (say 10 to a max of 15) as it becomes prohibitive to do that when the number of features becomes large. Then you have no choice but to go for automated feature elimination.

Let us look at both of them.

As already mentioned, this is possible only when you have fewer variables.

The steps involved are:

Build a model

Check for redundant or insignificant variables

Remove the most insignificant one and go back to step 1

Right. You build the model and then you try to drop those features that are least helpful in the prediction. How do you know that a variable is least helpful? Two factors can be looked at. Either the **p-value** of all the variables or **VIF (Variance Inflation Factor) **of all the variables

P-value is a concept that is part of hypothesis testing and inferences. I do not plan to explain that today. Hoping you are aware of it if you have done any regression modelling. But, in a nutshell, you need to know the following about p-value:

P-value is a measure of the probability that an observed difference could have occurred just by random chance.

The lower the p-value, the greater the statistical significance of the observed difference

Therefore, if any variable exhibits a high p-value, i.e. greater than 0.05, typically, you can remove that feature.

To know more about VIF, please refer to my article on __Multicollinearity__ Just to summarise here, VIF gives a measure of how well one of the predictors can be predicted by one or more of the other predictors. This implies that that predictor is redundant. If VIF is high, there is a strong association between them. Hence that predictor can be removed.

Similarly, if you get a VIF of greater than 5 (just a heuristic), that feature can be eliminated as it has great collinearity with some of the other features and hence is redundant.

This process is repeated one variable at a time and the model is rebuilt again. Then, similar checks are made and any other insignificant or redundant variables are removed one by one, till you have only significant variables contributing to the model.

As you can see this is a tedious process. Let us see an example before we go to Automated feature elimination

This example is for predicting house prices given with13 features. A heatmap of the features is shown here:

I start with 1 feature which seems highly correlated with the price i.e. '*Area'*.

When I create the linear regression model with just this variable, I get the summary as this:

(This is marked as **Step 1 **in the code provided in the Jupyter notebook later)

The R-squared value obtained is 0.283. We should certainly improve the value. So we add the second most highly correlated variable, i.e. '*bathrooms'*. (This is **Step 2** in the code)

Then, the R-squared value improves to 0.480. Adding a third variable '*bedrooms*' (**Step 3**) improves it to 0.505. Then, I have added all the 13 variables (**Step 4)**, the R-squared changes to 0.681. So, clearly, not all variables are contributing in a big way.

Then, I use VIF to check the redundant variables. (code snippet here)

The result I get is:

Clearly, we see that the '*bedrooms*' has a high VIF, implying that it is explainable by many other variables here. However, I also check the p-values.

In the p-values, I see that '*semi-furnished*'' has a very high p-value of 0.938. I drop this and rebuild the model (**Step 5**)

When I check the p-values and the VIF again, I notice that there is one variable '*bedrooms*' that has both a high VIF of 6.6 and a high p-value of 0.206. I choose to drop this then. (**Step 6)**

Finally, in **Step 7**, I note that all VIFs are below 5 but '*basement*' has a high p-value of 0.03. This is dropped and the model is rebuilt.

This leads us to a place where all remaining features are showing a significant p-value of <. 0.05 and VIFs are < 5.

These remaining 10 features are taken as the selected features for model development.

Here is the Jupyter notebook showing all the steps explained above.

Now, let us see how can we improve with the help of Automated feature elimination.

There are multiple ways of automating the feature selection or elimination. Some of the often used methods are:

Recursive Feature Elimination (RFE) - Top n features

Forward, Backward or Stepwise selection - based on selection criteria like AIC, BIC

Lasso Regularization

Where AIC is __Akaike Information Criterion__ and BIC is__ ____Bayesian Information Criterion____ __- different criteria that are used for model comparison

We will theoretically look at each of these before I share a code based example for one of these methods.

This is where we give a criterion to select top '**n'** features and the n is based on your experience of the domain. It could be the top 15 or 20, totally depending on how many features you think are truly influencing your problem statement. This is clearly an arbitrary number.

Upon giving the features and the 'n' value to the RFE module, the algorithm goes back and forth with all the given features and then comes up with the **top n features**, that have the maximum influence or are least significant.

Forward Selection is where you pick a variable and build a model. Then, you keep adding a variable and based on the AIC criteria, you keep adding till you don't see any further benefit in adding.

Backward Selection is when you start with all features and you keep removing a variable at a time till you see the metric improves no more.

Stepwise is where you keep adding or removing and trying out till you get a good subset of features that are contributing to your metric

In reality, Stepwise is the popular way though Backward and Stepwise seem to give very similar results. This is all automatically done by libraries that have implemented these already.

This form of regularization makes the coefficients of the redundant features zero. Regularization is a topic that can be looked at in-depth in another article.

Here is the Jupyter Notebook with the same house price prediction example with recursive feature elimination:

**A brief explanation here:**

I am using the **RFE** module provided by **SciKit Learn**. Hence I use the Linear Regression module too from the same library as that is a pre-requisite for RFE to work. Follow through from Step 1, as all other steps before that are preliminary. data preparation steps.

In Step 1, RFE() is passed the model already created as '*lm*' and *10*, to say I want the *top 10 features*. It immediately marks the top 10 features as rank 1.

This line helps us see what are the top 10 features

I take only these features to start building my model now.

The rest of the steps from Step 2 show how to use the Manual Elimination method after RFE, and this is called the **Balanced Approach**. What is this approach? I will explain this approach before we go back to understanding the code.

This is the most pragmatic approach that employs both types of feature elimination - a combination of Automated and Manual. When you have a large number of features, say 50, 100 or more, you use automated elimination to reduce the total number of features to the top 15 or 20 features and then you use manual elimination to further reduce it to select the truly important features.

The automated method helps in coarse tuning while the manual method helps in fine-tuning the features selected.

In the code, I used RFE to come to the top 10 features. Once I have got the top 10 features, I go back to building the Linear regression model, checking for p-values and VIF values and then deciding what more needs elimination.

For that, I build a Linear regression model using the **statsmodel** library as I can get the p-value from the summary provided by this model. (*I do not have this option in the LinearRegression module of SKLearn.)*

I see that the R-squared value is pretty good at 0.669. However, I see that '*bedrooms*' is still insignificant and hence drop that variable in Step 3.

Upon rebuilding that model, I see that there are no high p-values. I check VIF and notice all are below 5. Hence these 9 features are shortlisted as the final set of features for the model.

It is important to use only the features that contribute towards predicting the target variable and hence feature selection or elimination is important. There are many ways of doing it and Recursive feature elimination is one of the automated ways.

Manual feature elimination has been discussed to appreciate the concept of feature elimination but in practical circumstances, it will be rarely used. It is useful only if we have very few features.

There are more advanced techniques of feature elimination like Lasso regularization too.

P-Value definition:

__https://www.investopedia.com/terms/p/p-value.asp__

The ability to run Challenger and Champion models together on all data is a very genuine need in Machine Learning, where the model performance can drift over time and where you want to keep improving on the performance of your models to something better always.

So, before I delve deeper into this architecture, I would like to clarify some of the jargon I have used above. What is a Champion model? What is a Challenger model? What is model drift and why does it occur? Then, we can look at the rendezvous architecture itself and the problems it solves.

Once you put your model into production, assuming it will always perform well is a mistake. In fact, it is said - "**The moment you put a model into production it starts degrading**". (*Note, most often '***performance***' in ML is used to mean statistical performance - be it accuracy, precision, recall, sensitivity, specificity or whatever the appropriate metric is for your use case*).

Why does this happen? The model is trained on some past data. It performs excellently for any data with the same characteristics. However, as time progresses, the actual data characteristics can keep changing and the model is not aware of these changes at all. This causes model drift i.e. degradation in model performance.

For example, you trained a model to detect spam mail versus ham mail. The model performs well when deployed. Over time, the types of spam keep morphing and hence the accuracy of the prediction comes down. This is called **model drift**.

The model drift could happen because of a **concept drift** or a **data**** drift.** Not getting into these today. It will suffice us to understand that the performance of a model does not remain constant. Hence we need to continuously monitor the performance of a model. Most often, it is best to retrain the model with fresher data more frequently or probably based on a threshold level in performance degradation.

Sometimes, even retraining the model does not improve the performance further. This would imply that you might have to understand the changes in the characteristics of the problem and go through the whole process of data analysis, feature creation and model building with more appropriate models.

This cycle can be shortened if you can work with Challenger models even while we have one model in production currently. This is a continuous improvement process of Machine Learning and very much required.

Typically, the model in production is called the **Champion** model. And any other model that seems to work well in your smaller trials and is ready for going into production is a **Challenger** model. These Challenger models have been proposed because we assume there is a chance that they perform better than the Champion model. But how do we prove it?

A Champion model typically runs on all the incoming data to provide the predictions. However, on what data does the Challenger model run?

There are two ways that the Challenger models can be tested. The ideal case would be to run the Challenger model in parallel with the Champion model on all data and compare the results. This would truly prove the Challenger model can perform better or not. However, this seems prohibitive, especially in the big data world, and hence challenger is always trialled out on a subset of the incoming data. Once that seems to perform well, it is gradually rolled out to more and more data, almost like alpha-beta testing.

As you might be aware, that in alpha-beta testing, a small percentage of users or incoming data in this case is sent through a new test or Challenger pipeline and the rest all go through the original Champion pipeline. This kind of alpha-beta testing is good for some applications but clearly not very impressive in the world of machine learning. You are not comparing the models on the same data and hence can rarely say with confidence that one is better than the other for the whole data. There could be lurking surprises once you roll it out for all data and the model drift can start sooner than expected.

A typical alpha-beta pipeline would look like this.

The data is split between the two pipelines based on some criteria like the category of a product. This data split keeps increasing towards Challenger as the confidence in the performance of the Challenger model grows.

From a data scientist perspective, this is not ideal. The ideal would be to be able to run the Challenger model in parallel for **all the data** along with the Champion model. As I earlier said, this is very expensive.

Consider the worst-case scenario. If you want them to run in parallel, you have to set up two data pipelines that run through all the steps independently.

It would look something like this:

This has huge engineering implications and hence time to market implications too. The cost of this can get prohibitive over time.

A few of the top implications are the time and effort in building these pipelines over and over again without being sure if the Challenger model is indeed going to perform as expected. The CICD process, the deployments, the monitoring, authentication mechanisms etc. are a few to mention. In addition, the other cost is around the infrastructure that has to be doubly provisioned.

Considering if these pipelines are big data pipelines, it becomes all the more significant. Very soon you realise that this is not a scalable model. We certainly have to see how we can move away from parallel pipelines or even from the alpha-beta testing method.

As a corollary, the best-case scenario would be when we can reuse much of the data pipelines. The idea is to minimize the amount one has to develop and deploy again into production. This would also ensure optimization of infrastructure usage. This is one line of thinking about how to optimize.

Even better would be to be able to just **plug in the Challenger model** and the rest of the pipeline plays as if nothing has changed. Wouldn't that be fantastic? And this is what is made possible by the **Rendezvous architecture.**

The Rendezvous architecture as written in the book is tilted towards ML with smaller data. I have tweaked it to meet the needs of the big data world and associated pipelines as shown in the diagram below: *(References to the book and another article are given below in the references section)*

Let me now explain section by section of this architecture:

This consists of the standard data pipeline for receiving incoming data, cleansing it, preparing it and creating the required features. This should be just one pipeline for every model that is to be deployed. The prepared data should maintain a standard interface that has all the features that may be required in that domain irrespective of the model on hand. (*I understand this is not always possible and may need tweaking piecemeal over time. But we can deal with that piece in isolation when required*)

This is a messaging infrastructure like Kafka that comes into play bringing in a sense of asynchronicity. The data that is prepared as features are published onto the message bus. Now, every model listens to this message bus and triggers off, executing itself with the prepared data. This message bus is what enables a plug and play architecture here.

This is the part where all models are deployed one by one. A new Challenger model can be deployed and made to listen to the message bus and as data flows in, it can execute. Any number of models can be deployed here and not just one Challenger model! Also, the infra requirement is only for the extra model to run. Neither the pre-model pipelines nor the post model pipelines need to be separately developed or deployed.

As you can see in the figure, you can have many challenger models as long as the data scientist sees them mature enough to be tested against real data.

Also, there is a special model called the decoy model. In order to ensure that each of the model processes is not burdened with persistence, the prepared data is also read by what is called a **decoy model,** whose only job is to read the prepared data and persist. This helps for audit purposes, tracing and debugging when required.

All these models again output their predictions or scores into another message bus thus not bringing any dependency between themselves. Also, again this plays an important role in ensuring the pluggability of a model without disrupting anything else in the pipeline.

From there the rendezvous process picks up the scores and decides what needs to be done, as described in Part 5.

This is where the new concept of a **Rendezvous process i**s introduced, which has two important sub-processes. One immediate subprocess takes care of streaming out the correct output from the pipeline for a consumer from among the many scores it has received and the other process is to persist the output from all models for further comparison and analysis.

So, we have achieved two things here:

The best output is provided to the consumer

All the data has gone through all the models and hence they are totally comparable in like circumstances, on their performance

How does it decide which model's output should be sent out? This can be based on multiple criteria like a subset of data should always be from Challenger and another subset should always be from Champion. This is almost like achieving the alpha-beta testing. However, the advantage here is that while it sounds like alpha-beta testing for a consumer, for the data scientist, all data has been through both the models and so they can compare the two outputs and understand which is performing better.

Another criterion could be that the output should be based on model performance. In this case, the rendezvous process waits for all models to complete and publish to the message bus. Then, it seeks the best performance metric and sends out that as the result.

Yes, another criterion can be that of time or latency. If we need to have the result in say less than 5 seconds, for example, the process waits for all the results from models, up to 5 seconds, compares only those and returns the best data. Even though another model comes back in the 6th second that may have performed much better, that is ignored as it does not meet the latency criteria.

But how does this process know what is the criteria to follow for which data or which model? This can be put in as part of the input data into the message bus in Part 2. Note that the Rendezvous process is also listening to these messages and gets to know what to do with the output that corresponds to an input. There could be other clever ways too but this is one of the ways proposed.

By introducing asynchronicity through message buses, a level of decoupling has been introduced bringing in the ability to play and play models into an otherwise rigid data pipeline.

By introducing the rendezvous process, the ability to select between various model outputs, persist them, compare them were all introduced. And with this, it now does not seem a herculean task to introduce or support any number of new models for the same data set.

The rendezvous architecture gives great flexibility at various levels.

A variety of criteria can be used to decide what score, output or prediction can be sent out of the prediction process. These criteria could be latency based, model performance-based or simple time-based.

It provides the ability to define and change these criteria dynamically through the rendezvous process. You can take it to another level by introducing a rule engine here

It provides the ability to make all the data go through all the pipelines or choose only a subset to go through many pipelines. For example, if you have grocery and general merchandising products for which you are forecasting, groceries could go through their own Champion and Challenger models while general merchandise which are typically slow sellers can have their own pipelines.

It also provides the ability to run many models at a time without redeveloping or redeploying a large part of a big data pipeline. Apart from effort and time savings, the infrastructure costs are also optimised

__Machine Learning Logistics__An article on

__towardsdatascience.com:__"__Rendezvous Architecture for Data Science in Production__" by Jan Teichmann

**Big data architectures **provide the logical and physical capability to enable high volumes, large variety and high-velocity data to be **ingested, processed, stored, managed** and **accessed**.

The marriage of these two opens up immense possibilities and large enterprises are already leveraging the benefits. To understand how to bring the two together, we would first need to understand them individually.

The Machine learning architecture is closely tied to the process of ML as described in my earlier article: "__Machine Learning Process - A Success Recipe__"

As a quick recap, a typical ML process would involve the steps depicted here:

It has a two-phased process of learning and predicting, the former feeding into the latter.

However, when the Machine Learning Model has to be put into production, a few more aspects have to be taken care of, as shown here;

The aspects added are cross-cutting** concerns** and are shown by the four layers at the bottom of the diagram.

**Task Orchestration:** the ability to orchestrate tasks like feature engineering, model training and evaluation on computing infrastructures like AWS or Azure. Dependency management would be an important aspect here and most often it is non-trivial.

**Infrastructure: **Provisioning of infrastructure, providing elasticity of the same through options like containerization is essential.

**Security:** With that, the additional layer of security through authentication and authorization needs to be added.

**Monitoring: **Continuous monitoring of the infrastructure, the jobs and the performance are all non-trivial aspects to be taken care of in production.

The final aspect of providing **feedback about the statistical performance** of the model itself giving opportunities to auto-tune the model would be a great value add. (shown by the dotted line from adaptation to Data collection).

It goes without saying that the code written, should follow best practices of modularity and __SOLID__ principles leading to maintainability and extensibility.

All of this is good as long as the scale of data does not cross what can be handled by single large machines. In the realm of single large machines, all of this would ideally be deployed as containerized applications or traditional n-tier architectures with their own data stores, processing capabilities and would expose the models through APIs,

But the moment the scale of data crosses such a boundary, the only way to handle is to use **distributed architectures**. The Big data stack provides one such ecosystem whose primary functioning is based on distributed computing and storage principles. Let us understand Big data architecture and its capabilities.

Let us now understand a typical application architecture on a big data platform. This would include building data lakes and the ability to serve its various customers, typically consisting of data analysts, business analysts, data engineers.

*And when ML and Big data come together, the customers include data scientists and ML engineers too. (which I will address in the next section)*

This is a generic architecture that should serve most enterprise data lakes - both from the perspective of building the lakes as well as serving data from the lakes for various use cases and stakeholders. No technology stack other than Hadoop is mentioned here, as each of the components have multiple options and should be evaluated based on the use cases of the organization.

This architecture has multiple elements in it - the Ingress pipeline, the various data zones, the data processing pipelines, the streaming layer, the egress and the serving layers. Each of these components have to be well thought through to ensure they serve almost all the use cases of an organization.

**The Ingress pipeline. **All data coming into your data lake should come through a common mechanism, so that data governance, data lineage management and aspects of data security can all be centralised and governed well. This part can grow into an unmanageable nightmare if you allow multiple ETL (Extract, Transform, Load) or ELT (Extract Load Transform) pipelines.

**The Landing Zone:** All data that comes in and does not need near-real-time processing lands here and is maintained for a pre-defined duration in the original raw format, for audit and traceability purposes. Practices of regular clean up have to be put into place.

**Data Validation:** Here is where all the types of data validation are done. Where possible, you can validate the data by comparing with the source and where not possible, validate the data for its own semantics as described in detail in my article on "__Data Validation - During ingestion into data lake__". As there are no ready-made tools for this, building a framework will take you a long way.

**Data Lake:** Data lake is where you have data that is trustworthy, to be served to all of the consumers. However, this data is still in its original form, albeit clean. This data as it is, is very useful for deriving insights. Considering that this is a big data platform, you can allow years of historical data to grow. This is immensely valuable for an organization that believes "Data is the new gold". Data can be read from here directly but most often requires further transformations.

**De-normalized layer and Data Cubes**: As the data is huge, joining data and deriving insights becomes a highly expensive process. Hence, one of the best practices is to be able to create a de-normalized layer of data for each domain in the organization. Then, all the users of that domain can get what they are looking for without expensive processing over and over again. The denormalized layer is almost equivalent to the facade design pattern. While the sources of data may change, as long as the domain picks up from the denormalized layer, it is protected from the change in sources.

Also, if very similar aggregations are required repeatedly, building cubes of data with pre-aggregation could be a good idea. You could even introduce big data OLAP capabilities here so that it can be served to reporting tools more natively. Some of the big data OLAP tools have been discussed in my article "__Hadoop for Analysts"__

**Egression or Serving Layer:** Once the data is processed, transformed and available, you have to be able to serve this data through a serving layer. This could be providing APIs through various technologies. An API can serve data right out of the Hadoop platform or you could publish this data out of Hadoop. An egress framework here would ensure that data produced within the data lake can be made available for all types of consumers to consume in batches or even near real-time.

If all the above aspects are taken care of, you have a working architecture for building data lakes and using them on a big data platform.

Having understood both the architectures independently, we need to see how they can work together and allow for new possibilities.

Since Machine Learning is all about "**Learning from Data**" and since Big data platforms have data lakes consisting of all the data one can have, it is but logical that they come together and provide even more insights and even better predictions opening up opportunities to businesses as never seen before.

Have a look at the amalgamated architecture. All you have to do is extend your data pipelines to now support machine learning too.

Most of the architecture looks very similar to the big data architecture, right? and yes, that's the point. Just extending it a little, as shown in the red dotted lines, gives your machine learning models, the power of a big data platform.

Let us focus on the pipeline starting with feature engineering up to predictions. Now you can use the data from the data lake and transform it into required features using the power of a distributed platform. The features can be stored in a feature repository that feeds into the models that are being trained.

The output of parametric models (like logistic regression) can be stored in a models repository and either egressed out or served through APIs. In the non-parametric models where the whole data is required (as in K-Nearest Neighbours kind of algorithms), you can deploy the algorithm code as part of the pipeline itself.

This shows that just continuing to extend the data pipelines that existed so far into algorithms and models, is the only extra part to be done!!

The rest of the aspects of production-ready machine learning algorithms, consisting of authentication, monitoring, task orchestration, and infrastructure provisioning are all available out of the box from the stack here. None of these is explicitly depicted in the diagram because it is taken for granted on this stack.

You do not have to work anymore, with small data sets, only to find that when you deploy with larger data, the statistical performance has degraded!! Power unto you, power unto the data scientists and ML engineers - with all of the data, the processing power and the large memory.

Doesn't this sound liberating? It is indeed, though there are a few challenges and nuances one has to understand to make this work for your organization.

Machine Learning in a containerised world itself is a very empowering paradigm. Culling unforeseen insights and predictions have become a reality with the ushering in of ML.

Big data platforms like Hadoop have got the parallel processing capability of a distributed architecture to every enterprise - big or small, with the help of affordable commodity hardware. Combining the two opens up new vistas for any organization.

However, tread carefully on how you set up the two aspects of ML and big data together. Skillsets needed for the same cannot be underestimated. Upfront architectural thinking is a must. Understanding your company's use cases and the risk appetite, you would have to do a series of POCs for finalizing your custom set-up. However, the above article should give you a jump start on that thinking.