The idea of the split is to ensure that the subset is more pure or more homogeneous after the split.

There are two aspects we need to understand here:

- The concept of homogeneity or purity - what does it mean?
- How do we measure purity or impurity?

Only then, we can use this for splitting the nodes correctly.

Let us take an example to understand this concept.

Take a look at these two sets of data:

In dataset A, we have 4 boys and 4 girls. An equal number of both genders. This means that this data set is a complete mix or completely non-homogenous as there is a big ambiguity about which gender this data set represents.

However, if you look at dataset B, you will see that all are girls. there is no ambiguity at all. This is, therefore, said to be a completely homogenous dataset with 100% purity or no impurity at all.

And all data could lie somewhere in between these two levels of impurity: Totally impure to totally pure.

This concept, though it seems trivial or pretty obvious, becomes the foundation stone on which decision trees are built and hence I thought of calling this out explicitly.

Since decision trees can be used for both classification and regression problems, the homogeneity has to be understood for both. *For classification purposes, a data set is completely homogeneous if it contains only a single class label. For regression purposes, a data set is completely homogeneous if its variance is as small as possible*

- We want to be able to split in such a way that the new groups formed are as homogenous as possible. We are not mandating a 100% purity but intend to move towards more purity than the parent node.
- And we will see later (in another post) that the variable or feature we use for splitting also uses this concept of increasing homogeneity.
- Going further, this concept also helps in determining the importance of a feature to an outcome, a very useful aspect when we want to take actions at a feature level.

There are various methodologies that are used for measuring the homogeneity of a node.

The two commonly used ones are:

- Gini Index &
- Entropy

But to better understand this concept, I will also look at another method by name "Classification Error" or Misclassification measure

This is mostly never used in real life problems but helps clarify the concept of an impurity measure with ease.

This is the error we see when we assign a majority class label to a node. The number or probability of the whole minority data points being misclassified is called a classification error.

Therefore, the formula for the classification error is:

We will understand this with an example in the next post.

Similarly, I will share the formula for Gini Index and Entropy as well here but get into details with examples in the next post

This is defined as the sum of p(1-p) where p is the probability of each class and is represented better as:

where there are K number of classes

This concept has come from Physics - in particular thermodynamics where entropy is defined as *a measure of randomness or disorder of a system. *And in some sense that is what we are also trying to measure when we look for impurity.

Here, it is defined as the sum of p*logp (log to the base 2) where p is the probability of each class in that node. It is represented as:

What do these formula convey is something that we can get into with some examples next week and understand how each one of these measures help in meaning the purity or the impurity of a node.

]]>There are a few characteristics of Decision Trees that makes them stand out as useful algorithms in specific situations.

1. **They are highly interpretable.**

If a patient falls in the last red box on the right and is diagnosed as diabetic, you know why. You can explain that this is a male and that his fasting sugars are around 180 and hence a diabetic, with a high probability.

Interpretability is an important need for many organizations to go with an algorithm. If something goes wrong with a business decision, you at least should be able to explain, what went wrong and correct it. If the interpretability is not present, many top stakeholders may be squirmy about accepting your algorithms.

**2. It is a versatile algorithm.** It can be used for classification or regression problems. We know that classification is used when the target variable is discrete and we use regression when it is continuous. So, in decision trees, we could check the purity or impurity based on the homogeneity of the class of the node for classification but could use something like the sum of squared errors (SSE) to find the lowest SSE point for a split in regression. i.e. by just changing the measure of purity, both classes can be handled.

3. **Decision trees handle **** multi-colinearity** better than linear regression does. In fact, multi-colinearity does not matter here. We cannot interpret the linear regression well if multi-colinearity is not handled. But that is not the case here. Decision trees have no impact if the data is multi-colinear

4. Building the tree with splits is **pretty fast **and works well on large datasets too.

5. **It is also scale-invariant.** Unlike in linear regression, the importance is not given based on the varying scales. The values are compared only within an attribute and hence without scaling, you can use data for decision trees. You can refresh your memory on __Feature Scaling in my earlier article__.

6. Another important advantage of decision trees is that it can **work with data that has a non-linear relationship between predictors and the target variable**. It will partition that data into subsets that are probably linear. Hence creating a sufficient number of splits helps in dealing with non-linear relationships.

So it has carved its own niche in solving problems because of these advantages.

However, it has certain **disadvantages** too and that needs to be kept in mind as well when finally deciding to go with it or not

- Decision Trees can create overly complex trees that lead to overfitting
- They are also said to be an unstable model as they can vary largely with even small variations in training data
- Also, they are not good for extrapolation. They can work well within the range of data used in the training
- Decision trees can also create biased trees if the data has one class dominating. Hence the data set has to be balanced before fitting a decision tree

I plan to get into more details on building Decision Trees in the upcoming articles.

]]>A decision tree algorithm is one that mimics the human decision-making process. It checks an answer to a question which can have more than one answer, branches off based on an answer and then the next question is asked. This continues till all questions are answered.

For example, taking a very simplistic situation, which we mentally sort out very easily on a daily basis – taking a decision based on weather and time available as to what transport I should choose to reach a venue, will look like the below decision tree:

Is the Weather cloudy, sunny or rainy? And then the next question answered is the amount of time available to reach the destination. Based on these two questions, a decision is made on the transportation to take.

This is clearly an oversimplification of how the decision tree works. But this is just to explain what decision trees look like.

They have root nodes – the weather node here. Then, they have internal nodes like the cloudy and sunny nodes. And then you have the leaf nodes that give the final decision. There is a decision made at every node that decides the direction of the final decision.

Decision trees are clearly supervised models built based on already existing data that help create the internal and final leaf nodes based on various criteria.

As you can see it is a highly interpretable model. If you reach the decision to walk, you know it is because it is cloudy and you have more than 30 minutes to reach the venue.

Let us see a practical example, of a model that helps in predicting whether one is diabetic or not:

Here you can see that people aged less than 58 and are females with fasting sugar less than 120 – mostly do not have diabetes. And people above 58 and males with fasting sugar <= 180 mostly have diabetes. There are many intermediate stages where the decision might change to having or not having diabetes. Also, a majority class at that node decides the class of that node bringing in only a particular level of accuracy in the prediction.

To get more and more accurate, maybe you go on till every node has only one data point and that is accurately predicted. This would then be a complete case of overfitting with 100% accuracy on the train data and may perform very poorly on any test data. This has to be avoided by finetuning what are called hyperparameters that are discussed later.

1. It is a supervised algorithm – meaning that it learns from already classified data or data which already has the variable that needs to be predicted

2. It is also called a greedy algorithm. It maximises the immediate benefit rather than taking a holistic approach. The greedy approach makes it vary drastically with even small variations in the data set. Hence, it is called a high variance model.

3. This happens in a top-down approach as well.

As you can see, we recursively split the data into smaller data sets. Based on what? Based on some features and the values of those features.

The data has to be split in such a way that the homogeneity or the purity of the subset created should be maximised. However, how do we decide which feature to split by, first and which feature goes next? How do we decide what is the value based on which the split threshold is decided? How long do we go on splitting i.e. what is the stopping criterion?

There are what are called Hyperparameters that help in making most of the decisions that need to be made.

Hyperparameters are simple parameters that are passed to a learning algorithm to control the training of the model. This is what helps in tuning the behaviour of the algorithm.

For example, in the Decision tree model, when the model is instantiated, a hyperparameter that can be passed is the max_depth – the number of levels that you want to split and train up to.

So, if you give max_dept as 5, the splits will happen only up to 5 levels even though the accuracy may be questionable. Hence there is a lot of power in the hands of the model designer to tune the learning to get a better result.

Similarly, there are many more hyperparameters that we will see with more examples in later posts which will make this concept clearer.

Just to get a peek into the possible hyperparameters, look at this piece of code:

Here the Decision tree classifier has been given one hyperparameter i.e. max_depth. the others are left as default. But the others that can be given are max_features, max_leaf_nodes, min_impurity_decrease, min_impurity_split etc.

Tuning the values of each of these can control overfitting and yet improve the accuracy of predictions on new data.

Hope this gave you a high-level introduction to decision trees.

]]>Just to reiterate the problem statement, an NGO that is committed to fighting poverty in backward countries by providing basic amenities and relief during disasters and natural calamities has got a round of funding. It needs to utilise this money strategically to have the maximum impact. So, we need to be able to choose the countries that are in dire need of aid based on socio-economic factors and health factors.

I would highly recommend that you go through my __article on K-Means__ to understand the solution thinking, data cleansing, exploratory data analysis and data preparation steps.

Here I would like to just touch upon the Hierarchical modelling aspects instead of the K-Means algorithm used in the __previous article on K-Means__.

NOTE: The data and code for this article is fully available in myGithub repoas ajupyter notebook. You can look at it in detail and execute it to understand the complete flow.

In the Notebook, steps 1 to 4 are all around data understanding, cleaning and preparation which remain the same irrespective of the type of clustering that we are aiming to work with. these steps have all been detailed in the __K-Means Clusering article__, already mentioned.

Here I go directly to Step 5.

I use the scipy library here instead of the normally used scikit in earlier examples.

So, the three imports I have done are:

```
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree
```

**linkage** is a library that allows you to choose the type of linkage, which has been recently discussed in my article on __types of linkages for Hierarchical clustering__. One has to keep in mind the size of data on hand and the order of complexity of computation that would be required to arrive at the clusters while deciding the linkage type. of course, you also want as distinct a set of clusters as possible. This is the balance that has to be ensured at this step.

The **dendrogram** routine in the scipy package helps you visualise the dendrogram created by the hierarchical model. The **cut_tree** routine helps in creating the clusters by cutting the dendrogram into the number of clusters you want to get.

As seen in the __article on linkages__, the single linkage model is created by just one line of code:

```
h_model_1 = linkage(country_scaled, method="single", metric='euclidean')
dendrogram(h_model_1)
plt.show()
```

Then the dendrogram obtained is:

Clearly, this is hardly interpretable and clean. This relies on taking the smallest distance between clusters as the measure of dissimilarity.

We then try complete linkage to see if we get a better dendrogram:

```
h_model_2 = linkage(country_scaled, method="complete", metric='euclidean')
dendrogram(h_model_2)
plt.show()
```

This creates a much cleaner dendrogram and you can see at what level you may have clear distinct clusters formed. You could choose to have 2,3 5 or even 6 clusters depending on your business case.

In the jupyter notebook, I have first decided to go with 3 clusters and use the cut_tree routine to achieve the same:

```
h_cluster_id = cut_tree(h_model_2, n_clusters=3).reshape(-1, )
h_cluster_id
```

Then, I assign the cluster id so obtained, to the country dataframe as seen here:

```
country_hier3 = country_df.copy()
country_hier3['cluster_id'] = h_cluster_id
country_hier3.head()
```

And if I were to count how many countries are in each of the clusters, I see this:

When I profile these clusters, I realise that cluster 0 is the one containing poor nations that need aid and there are 50 countries here. That is not helpful for me to get back to the CEO saying the money on hand is needed for 50 countries. No one would benefit from this.

Hence I now move to cut the tree for 5 clusters. This improves the numbers for me. How do I understand this?

Let's have a look at some of the profiling steps.

I plot a scatter plot of the 5 clusters as shown here:

```
plt.figure(figsize = [15,10])
plt.subplot(2,2,1)
sns.scatterplot(x = "gdpp" , y = "child_mort", hue = 'cluster_id', data = country_hier5, palette = "Set1", legend = "full")
plt.subplot(2,2,2)
sns.scatterplot(x = "income" , y = "child_mort", hue = 'cluster_id', data = country_hier5, palette = "Set1",legend = "full")
plt.subplot(2,2,3)
sns.scatterplot(x = "income" , y = "gdpp", hue = 'cluster_id', data = country_hier5, palette = "Set1", legend = "full")
plt.show()
```

There are a whole host of countries represented by the red dots that seem to have a very low GDPP and high child mortality, similarly low income and child mortality and finally low income and low GDPP.

We can get another view by plotting a bar graph:

```
country_hier5.groupby('cluster_id')['gdpp','child_mort','income'].mean().plot(kind = 'bar')
```

The scale of GDPP and income of the better-off countries is so large that the child mortality numbers are hardly visible. In spite of that, you can see it is clearly visible in cluster 0. Hence that seems the cluster with the poorest nations.

However, how many countries are part of cluster 0, let us check.

There are 38 of them. The 50 earlier have been split into cluster 0 with 38 and cluster 3 with 12.

Let us also get an idea of the spread and the median of the 5 clusters around GDPP, income and child mortality by plotting box plots:

```
plt.figure(figsize = [15,10])
plt.subplot(2,2,1)
sns.boxplot(x='cluster_id', y = 'gdpp', data = country_hier5 )
plt.subplot(2,2,2)
sns.boxplot(x='cluster_id', y = 'child_mort', data = country_hier5 )
plt.subplot(2,2,3)
sns.boxplot(x='cluster_id', y = 'income', data = country_hier5 )
plt.show()
```

Absolutely clear that the child mortality spread and the median is high for those countries with the lowest income and GDPP.

Then, we can prioritise amongst these countries by sorting on child mortality, GDPP and income as they seem to be the indicators that we can choose for prioritisation:

```
country_hier5[country_hier5['cluster_id'] == 0]\
.sort_values(by = ['child_mort','gdpp', 'income'], ascending = [False,True,True])['country']
```

The top 10 list looks like this:

With this, we have some conclusions to represent the data and the suggestions to the CEO on the utilisation of funds.

Each of these pieces of code is simple and very easy to understand and execute in a Jupyter notebook. Do try it out yourself.

Wishing you a great time exploring the code and adding your own nuances to it.

Once again the data and the code is available in my git repo at

__https://github.com/saigeethamn/DataScience-HierarchicalClustering__

There was a mention of "Single Linkages" too. The concept of linkage comes when you have more than 1 point in a cluster and the distance between this cluster and the remaining points/clusters has to be figured out to see where they belong. **Linkage is a measure of the dissimilarity between clusters having multiple observations. **

The types of linkages that are typically used are

- Single Linkage
- Complete Linkage
- Average Linkage
- Centroid Linkage

The type of linkage used determines the type of clusters formed and also the shape of the dendrogram.

Today we will look at what are these linkages and how do they impact the clusters formed.

Single Linkage is the way of defining the distance between two clusters as the minimum distance between the members of the two clusters. If you calculate the pair-wise distance between every point in cluster one with every point in cluster 2, the smallest distance is taken as the distance between the clusters or the dissimilarity measure.

This leads to the generation of very loose clusters which also means that the intra-cluster variance is very high. This does not give closely-knit clusters though this is used quite often.

If you take an example data set and plot the single linkage, most times you do not get a clear picture of clusters from the dendrogram.

From the plot here, you can see that the clusters don't seem so convincing though you can manage to create some clusters out of this. Only the orange cluster is quite far from the green cluster (as defined by the length of the blue line between them). Within the green cluster itself, you cannot find more clusters with a good distance between them. We will see if that is the same case using other linkages too.

As you can recollect, the greater the height (on the y-axis), the greater the distance between clusters. These heights are very high or very low between the points in the green cluster meaning that they are loosely together. Probably, they do not belong together at all.

*NOTE: A part of code used to get the above dendrogram:*

```
mergings = linkage(sample_df, method='single', metric='euclidean')
dendrogram(mergings)
plt.show()
```

Can we get better than this with other linkages, let us see?

In Complete Linkage, the distance between two clusters is defined by the maximum distance between the members of the two clusters. This leads to the generation of stable and close-knit clusters.

With the same data set as above, the dendrogram obtained would be like this:

Here you can see that the clusters seem more coherent and clear. The orange and green clusters are well separated. Even within the green cluster, you can create further clusters in case you want to. For example, you can cut the dendrogram at 5 to create 2 clusters within the green.

Here you can also note that the height between points in a cluster is low and between two clusters is high implying that the intra-cluster variance is low and inter-cluster variance is high, which is what we ideally want.

*Note: Code used for the above dendrogram:*

```
mergings = linkage(sample_df, method='complete', metric='euclidean')
dendrogram(mergings)
plt.show()
```

In Average linkage, the distance between two clusters is the average of all distances between members of two clusters. i.e.e the distance of a point from every other point in the other cluster is calculated and the average of all the distances is taken.

Using the same data set, an average linkage creates the clusters as per the dendrogram here.

Here again, you can note that the points within a suggested cluster have a very small height between them implying that they are closely knit and hence form a coherent cluster.

*Note: Code used for this dendrogram:*

```
mergings = linkage(sample_df, method='average', metric='euclidean')
dendrogram(mergings)
plt.show()
```

One thing for sure is that the K-value need not be pre-defined in Hierarchical clustering. However, when you start building this tree (dendrogram), the point in one cluster cannot move. It is always in a cluster that it already belongs to and more can add to it but cannot shift clusters. Hence this is a linear method. This is a disadvantage sometimes.

Also, note that each linkage calculation is pair-wise between clusters and hence requires a huge amount of calculations to be done. The larger the data, the more the RAM, as well as the, compute power.

Between the three types of linkages mentioned above, the order of complexity of calculations is least for Single linkage and very similar for average or complete linkage.

References:

- Order of complexity for all linkages:
__https://nlp.stanford.edu/IR-book/completelink.html__ - An online tool for trying different linkages with a small sample data:
__https://people.revoledu.com/kardi/tutorial/Clustering/Online-Hierarchical-Clustering.html__ - Single-link Hierarchical Cluster clearly explained:
__https://www.analyticsvidhya.com/blog/2021/06/single-link-hierarchical-clustering-clearly-explained/__

Today we shall delve deeper into **Hierarchical clustering**.

In K-Means, when we saw some of the practical considerations in__ Part 3__ we saw that we have to start with a value for K i.e. the number of clusters we want to create out of the available data. This is an explicit decision the modeller has to make. We also saw some methods to arrive at this K like the silhouette analysis and the elbow curve. However, this is in a way, forcing the data into a pre-determined set of clusters. This limitation is overcome in hierarchical clustering.

This is one distinct feature that makes it more advantageous than K-Means clustering in certain cases, though this becomes a very expensive proposition if the data is very large.

Now, if we do not have an upfront K value, we have to have some measure to help us decide whether a point creates a cluster on its own or it is very similar to other points in an existing cluster and hence belongs there. This is what is called the **similarity or dissimilarity measure**. **Euclidean distance** is the most common measure used as a dissimilarity measure. If the points are very similar, their euclidean distance would be very small and that implies they are very close to each other. In other words, **the points with a very small dissimilarity measure are closer to each other**.

Let us start with a sample data of 14 points, given here and see how the clustering is done.

Let us plot a scatter plot of this data and see if there are any visible clusters. Since we have only two dimensions, we have the luxury of visual representation.

When we move to actual industry problems, we would be dealing with much higher dimension data and hence visually, we can only do exploratory analysis with pair-wise data to get a feel for the clusters, at the best.

Looking at the plot above, we do see two-three clusters at least. Let us find out through hierarchical clustering, how many clusters are suggested and how close they are to each other.

The first step would be to assume that every point is its own cluster as shown in the diagram on "Startin Step". Since we have 13 data points, we are starting with 13 clusters.

Then, we calculate the euclidean distance between every point and every other point.

This will help us create the first cluster between the two nearest points.

In this case, points 10 and 11 are the nearest and they form the first cluster as shown in iteration 1.

Now you treat these two points as one cluster and so you have 12 clusters at the end of the first iteration or at the beginning of the 2nd iteration. During the 2nd iteration, point 9 joins the first cluster containing points 10 and 11, as that is the next nearest, as shown in Iteration 2.

This process goes on iteratively and the clusters that are closest to each other keep getting fused till we get one large cluster containing all the points. This is called **Agglomerative Clustering or AGNES (Agglomerative Nesting).**

This iterative process leads to the formation of a tree-like structure called the **dendrogram**. You can see the dendrogram for the above data in the figure below:

The first step is indicated by the smallest orange line joining points 10 and 11. The second iteration is represented by the orange line that joins 9 into the same cluster. In the third iteration, points 1 and 2 form their own cluster. You go by the height to know which is the next data point that formed a cluster.

The **height of the dendrogram** is a measure of the dissimilarity between the clusters and in this case that is the euclidean distance between the points. Based on this you can see that the dissimilarity is very low between points 5 and 6, 0 and 8, 1 and 2 respectively and they quickly form their own clusters on each subsequent iteration. The height at which they join also is very small indicating that the euclidean distance is very small between them.

When you see long lines that create branches way below, you know that the dissimilarity measure or the euclidean distance is very high between them and hence they form very distinct clusters. In the above diagram, the blue line is clearly indicating the dissimilarity is very high between the orange and the green clusters.

But how do we calculate the distance between a cluster containing many points (more than one) and another data point outside the cluster? This leads us to the concept of linkage.

**Linkage is a measure of dissimilarity between clusters having multiple observations. **

In a multi-point cluster, distance is calculated from every point in the multi-point cluster with an external point and the minimum distance is taken, often. This way of calculating the distance is one type of linkage called **single linkage**. This is not the best though in terms of good quality of clustering but is chosen because of the compute efficiencies involved.

There are other types of linkage that we can see in a subsequent post.

Having now gone through the AGNES process, we finally have to decide how many clusters to create and what are they? How do we identify the clusters using the dendrogram?

You do a horizontal cut on the dendrogram and every group below becomes a distinct cluster. If you cut at a height of 7 along the Y-axis, you create two clusters as shown in orange and green.

If you cut at a height of 5, you get three clusters consisting of data points12-9-10-11 in one cluster, 4-5-6-7 and 0-8-3-1-2 in the other two clusters

Libraries that exist for hierarchical clustering allow to "cut-tree" like the above and come up with varying number of clusters.

Hierarchical clustering is used when we do not want to decide the number of clusters upfront. It is computationally intensive due to the way linkages are established between the points to create the clusters.

There are two ways of hierarchical clustering, **AGNES or Agglomerative Nesting **and **DIANA or Divisive Analysis, **both of which have been discussed in the __Introduction to Clustering Algorithms__. While the former starts with each point as an independent cluster and goes iteratively to have all the points in one large cluster, the latter does the opposite.

In other words, bottom-up clustering is AGNES and top-down clustering is DIANA.

Most often the dissimilarity measure used is Euclidean distance. Linkages are the way this is calculated for multi-point clusters. This in summary is all about Hierarchical Clustering.

]]>Let us look at one practical problem and the solution

An NGO that is committed to fighting poverty in backward countries by providing basic amenities and relief during disasters and natural calamities has got a round of funding. It needs to utilise this money strategically to have the maximum impact. So, we need to be able to choose the countries that are in direct need of aid based on socio-economic factors and health factors.

Let us use K-Means Clustering to create clusters and figure out the countries that are in greatest need as per the data provided.

You may find the data and the entire code in this git repo:

__https://github.com/saigeethamn/DataScience-Clustering__

If we want only the top 5 or top 10 countries that deserve aid, then we could think of a regression model. But we could also use clustering as a way to find out the cluster of most needy countries. Once we get the clusters, within that we could further analyse and decide where does the aid go.

I could have done K-Means Clustering or Hierarchical Clustering. I will go with K-Means for now, as we have understood that, in theory so far.

So, how and where do I start? I will be following the preliminary steps outlined in my previous post on "__Steps towards Data Science or Machine Learning Models__"

In this post, I will not explain the code for data analysis or preparation. I will just explain the bare minimum through plots and insights, as this code is pretty repetitive for all analyses. However, the K-Means part alone, I will walk through the code too.

I have to load the data and understand it first. From the shape, I know that I have data of 167 countries and I have 10 columns of data including the country name. A brief description of the data is here:

Note that exports, health and imports columns are given as % of GDPP and hence they need to be converted back to absolute values for further analysis. You can refer to the notebook in the git repo to understand how that is done.

NOTE: The code for this article is fully available in thisGithub repoas ajupyter notebook. You can look at it in detail and execute it to understand the complete flow.

When I do a null value check, I do not find any missing data. Hence there is no null value treatment required and no columns or rows to be dropped either. All the data is numerical and so no categorical data encoding is also required.

Here, I should also do outlier analysis and treatment. However, I am interested in exploring the original data before I treat the outliers if any. Hence I move on to Step 3 consisting of EDA.

The main steps here are univariate and bivariate analysis. I plot the distplot for all the data as shown here:

Most are left-skewed implying a large number of countries probably are in that cluster and there is a small number in the far-right cluster - behaviours of these 6 features:

- Child Mortality
- Exports
- Health
- Imports
- Income
- Inflation

Life expectancy, total fertility, income and gdpp show there are visible clusters. For Bivariate analysis, a heat map and pair plot are sufficient as all the data in continuous data

From this, I see that

- There is a high positive correlation between GDPP and income, life expectancy, exports, health and imports.
- There is a negative correlation between GDPP with Total fertility, Child Mortality and inflation
- Exports, imports and health are highly correlated
- Health is negatively correlated with Total Fertility, Inflation and Child Mortality
- There is a strong correlation positive between Total fertility and Child mortality
- Also a positive correlation between income and life expectancy

Hence**, **we have a good chunk of correlated data that should help in creating clusters. a scatter plot also helps is see if there are any visible clusters and hence you do a pair plot like this one:

Having understood the basic data, we move to the next step of data preparation.

Since all the data is continuous data, we can look at box plots and see if there are outliers.

From this, we see that child mortality, exports, imports, income and inflation have outliers on the higher end and life expectancy at the lower end. Need to be watchful about capping the high-end values of data like inflation, child mortality and lower-end values of life expectancy as the needy countries should not lose out on aid due to this.

However, it is safe to cap the higher end values of income, exports, imports, gdpp. Hence I have chosen to cap the higher end at 0.99 quantile and the lower end to 0.1 quantile.

Not capping 'health' as it has almost continuous values outside the 100th percentile and that itself could contribute to a cluster. Again refer to the notebook in git to view the code for this.

Next, we scale the variables using a StandardScaler from sklearn.preprocessing library. Here, we do not split data into train and test as we are finding clusters across all of the data. It is not supervised learning and we do not test predictions against any target variable.

In __part 3__ of the theory on K-Means, I have spoken about having to check for the cluster-tendency of the given data. So, now we run the Hopkins test to check if this data shows a cluster-tendency.

A basic explanation for Hopkins statistic is available on __Wikipedia__ and a more detailed level discussion is available __here__. It compares the data on hand with random, almost uniform data and comes up with whether the given data is almost as uniform or shows a clustering tendency. For this data, as seen in the code, we get a value of anywhere between 0.83 to 0.95 indicating that there is a possibility of finding clusters and hence we go ahead with K-Means Clustering

The first step in modelling is to figure out what is correct K for our data since we want to initialise the model with K. Again as mentioned in __part 3__, this is done using the elbow method or the silhouette analysis.

First, let's see the code for KMeans clustering with a random k. The code for clustering itself is literally 2 lines.

We have to import the KMeans library from sklearn.

`from sklearn.cluster import KMeans`

If we choose to go with any arbitrary number for K and create the cluster, here's how the code would look:

```
kmeans = KMeans(n_clusters=4, max_iter=50)
kmeans.fit(country_scaled)
```

We instantiate an object of KMeans class as kmeans. There are 2 args we pass: one is k i.e. the number of clusters we want to create. Here it has been arbitrarily chosen as 4. The second is the maximum number of iterations the algorithm has to go through. Recollect that 2 steps of calculating distance and reassigning points to a centroid happen iteratively. These two steps may not always converge. So, in such a case, stopping after 50 iterations is what the max_iter stands for and returning the clusters formed at the last iteration. There are a lot more arguments that you can look at help and understand. But this is the bare minimum for invoking the KMeans algorithm. Then you just take this kmeans instance and fit it against the scaled country data and four clusters are formed. It is as simple as that!!

However, deciding the value of K is a very important aspect and let us see how we decide the optimal number of clusters.

We create multiple clusters starting with k=2, and going on with 3, 4, 5 and so on. When adding more number of clusters is not beneficial, we stop at that point. We start with using K-Means clustering with K=2 to say 10. Here's how the code looks.

K-Means algorithm used is from the library sklearn.cluster

```
ssd = []
for k in range(2, 10):
model= KMeans(n_clusters = k, max_iter=50, random_state=100).fit(country_scaled)
ssd.append([k, model.inertia_])
plt.plot(pd.DataFrame(ssd)[0], pd.DataFrame(ssd)[1])
plt.xlabel("Number of Clusters")
plt.ylabel("Total Within-SSD")
plt.show()
```

And the plot we get is:

Now, let us understand what we are doing in the code.

`model= KMeans(n_clusters = k, max_iter=50, random_state=100).fit(country_scaled)`

Here we call the fit method on KMeans for each k value ranging from 2 to 10 and create the model. And then we use an attribute from the model to understand which value for K gives good clusters.

KMeans algo has an attribute called intertia_ which you can see in the __sklearn documentation__ or by executing the command help(KMeans) in your jupyter notebooks. inertia_ is defined as the "sum of squared distances of sample to their closest cluster centre". So, if you have 3 clusters centres and each point is associated with one of them, then the squared distance of each of the points with their respective centres is given by inertia_. In fact, this is the cost function that we want to minimise as discussed in __part 2 __of my series on KMeans Theory.

So, we capture this for every k value in the range - in a list variable called ssd:

` ssd.append([k, model.inertia_])`

And the next set of statements plot the value of inertia_ against the k value. So, wherever we get a significant dip in inertia_, we take that as the k value of choice. After a particular k the inertia_ does not show any significant improvement.

So, we see that there is a sharp dip in ssd from K=2 to K=3. Then the rate of fall slows down from K=4. It further slows down with higher Ks. Because of the shape of the curve at K=3, it is called an elbow curve. Given this insight, we could choose K as 3.

Now let us look at Silhouette Score too.

Broadly speaking, It is a measure of the goodness of the clusters created. We have understood in __Part 1 __of the series on KMeans that we want the maximization of the inter-cluster distance and the minimization of the intra-cluster distance. This is what is encapsulated in the silhouette score.

In other words, a metric that measures the cohesiveness of a cluster and the dissimilarity between clusters is called the silhouette score.

It is represented as follows:

where

*p is the average distance to the points in the nearest cluster that the data point is not part of.*

*q is the average intra-cluster distance to all the points in its own cluster.*

Let us understand the intuition behind this. By definition of maximization of inter-cluster distance, p should be as large as possible and by minimization of intra-cluster distance q should be as small as possible. If q is very small, the ratio is almost p/p and hence 1. If q is very large, the ratio is -q/q and hence -1

Therefore, the silhouette score combines the two (p and q) and ranges from -1 to 1. A score closer to 1 indicates that the data point is very similar to other data points in the cluster and a score closer to -1 indicates that the data points are not similar to other data points in its cluster.

This is calculated for every point and for every K. Then the same is plotted on a graph for every k.

So, whichever K has the maximum silhouette score is the one with the best inter-cluster heterogeneity and intra-cluster homogeneity. The silhouette scores seem to be very similar for k = 3, 4 or 5.

Here is the code that is written to calculate and plot the silhouette score

```
from sklearn.metrics import silhouette_score
ss = []
for k in range(2, 10):
kmeans = KMeans(n_clusters = k, max_iter=50, random_state=100)
kmeans.fit(country_scaled)
silhouette_avg = silhouette_score(country_scaled, kmeans.labels_)
ss.append([k, silhouette_score(country_scaled, kmeans.labels_)])
plt.plot(pd.DataFrame(ss)[0], pd.DataFrame(ss)[1])
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.show()
```

Here we use the silhoutte_score from sklearn's metrics library. We create the KMeans clusters for each K in the range 2 to 10. And to the silhouette_score, we pass the scaled country data and the labels returned by KMeans to help in calculating the intra and inter-cluster distance averages for every point in this line:

`silhouette_avg = silhouette_score(country_scaled, kmeans.labels_)`

Finally, we gather the score against each k in the list named ss[] to help in plotting the graph.

Based on both of these tests, looks like 3 seems to be the right number of clusters. So, we will go ahead with this value of K and create 3 clusters.

```
kmeans3 = KMeans(n_clusters=3, max_iter=50, random_state=100)
kmeans3.fit(country_scaled)
```

Then we do the cluster analysis to see what direction or insight we get out of it. This cluster Profiling or analysis can help us finally say which are the countries that are in direst need of aid.

Let's now start with understanding how many countries are in each cluster:

From the above code, you can see that there are 90 in cluster 2, 29 in cluster 1 and 48 in cluster 0.

Let us now plot a scatter plot with the 3 most important variables: income, GDPP and Child Mortality, for each cluster:

You can see the countries represented in red dots have low GDPP, low income and high child mortality. They would be the countries that would best benefit from aid.

We can plot a bar graph and box plots to understand whether these clusters are truly distinct in their characteristics.

The bar graph shows that the gdpp and income are quite different for the 3 clusters. The box plots show how the mean of gdpp and income is very low for cluster 0 while the child mortality is very high. This makes the profile of the countries very clear.

Since we want only the top 10 countries, we can sort by gdpp, child mortality and income in the ascending, descending and ascending orders respectively and take the top 10 countries for providing aid:

If the budget was able to support, the entire list of 48 countries in this cluster could have been considered as each is only slightly better than the other. However, with budgets constraints, the top 5 to 10 needy nations could be considered so that the impact is at least felt and a difference is made to the people receiving the aid.

Finding K for K-Means is an important pre-step for clustering. There are multiple methods to find the appropriate K. Cluster Profiling helps us derive more insights. However, data understanding, data preparation and transformation before clustering are also important steps that cannot be overlooked.

Developing a model is mostly the easiest step. However, again deriving insights from the model to get actionable results requires a deep enough understanding of the problem on hand and the implications at the ground level.

The entire code for this is available in the git repo whose link is given above. With more details in comments and explanations at each step. This is a fairly simple problem that was addressed through KMeans clustering.

Hope this was a useful code walkthrough with a significant example.

]]>So, today instead of going into K-Means modelling, I thought, why not look at the steps we necessarily (may not be sufficient though) indulge in, before modelling of any sort.

Many want to learn Data Science and Machine Learning. And there is enough and more material available on the internet to learn. And sometimes that becomes the problem.

Libraries out there like __scikit-learn__ make machine learning look like child's play until you start solving real-world problems. It typically consists of 2 steps - fit and predict. The fit() method fits against the available data creating a model. Then you use that model to predict against new or unseen data. Doesn't that look so simple?

In fact, here is a snippet of the code from the scikit-learn library which shows the simplicity of the exercise and the coding involved:

Four lines of code - in which 1 line creates the model, 1 line uses the model to predict. Then, is data science so simple and easy?

Much of the work is all before the model creation itself.

So, today I would like to list out a set of basic steps that have to be done before you get into modelling. This is just an indicative set of steps, nowhere exhaustive, but can serve as a starting point for modelling many algorithms that are fundamental to data science.

A birds-eye-view of the steps involved is provided in this mindmap:

The first step is to explore your data and understand it. Then you clean and do some basic transformations. After this, you are in a position to do a detailed Exploratory Data Analysis that gives your deep insights into your data. And finally, you prepare the data as expected by the algorithm.

This involves understanding the size, shape, data type, column names, the multiple sources of data. Here we also take a look at which data is categorical in nature and which is continuous.

You could even get some basic statistics like the minimum, the maximum, the average, the 75th percentile data, to get a feel for the spread in the numerical data.

First, you check for null values and see if you can treat them meaningfully. Else you drop that data as it could create problems later on.

This means you drop columns that have a high percentage of null values. For other columns with nulls, based on the data and the meaning of the column, you can impute it using various mechanisms, simples of them being to impute with a 0 or mean or median. There are advanced techniques too that can be employed to impute. Sometimes it may be good to leave them unimputed as you do not intend to skew the data in favour or against a value. You may choose the drop the specific rows, instead.

Coming to the transformations at this stage, it is to ensure your data is transformed to allow for a meaningful exploratory data analysis.

Firstly, you can plot graphs and check for outliers. If there are any, treat them as detailed in my article on __treating outliers____.__

You may choose to create new variables through binning or through derivations from existing variables.

You can also transform categorical variables into numerical through techniques like one-hot encoding or label encoding

Now you are ready to start Exploratory Data Analysis

This is a very important step. Here is when you really get more familiar with your data through data visualization and analysis. You can see patterns and correlations between the predictors and the target variable or between the predictors themselves.

Broadly these are the steps involved in EDA.

- Univariate Analysis
- Bivariate Analysis
- Understanding correlation between data
- Plotting and visualising the data for any of the above steps
- Checking for imbalance in data

You do univariate analysis for categorical variables using bar charts and continuous variables using histograms as shown here:

You could even draw box plots to understand the spread of the data from a different view.

Then you do a Bivariate analysis that analyses the relationship between two variables. It can include heatmaps for correlation analysis, pair plots for continuous-continuous data relationships, box plots for categorical-continuous relationships and bar plots for categorical-categorical relationships as shown here.

You also check for imbalance in the target data so that you can model it through the correct means. At every stage, you draw some interferences about the data on hand.

You can see all of the above steps in detail through data and coding in this git hub repo on __Exploratory Data Analysis__. Sometimes the above steps may be slightly iterative.

Next, you start preparing the data for the model

If you have multi-level categorical variables, you create dummy variables to make them numerical columns.

You also need to scale numerical features. Typically, you scale features after splitting your data into train and test data so that there is no leakage of information from the test data into the scaling process. The various __scaling techniques __have already been discussed.

If you have a huge number of features, you could also go through feature selection through various techniques discussed earlier in __Feature Selection Techniques __including Recursive Feature Elimination, Manual elimination, Balanced approach and even Principal Component Analysis that I have not discussed so far.

Now that you have selected the features of importance, split the data into train and test sets and further scaled the data, you are finally ready to get into modelling.

Without all of these steps and may be more, if we do modelling, it would be the case of garbage in, garbage out. You may get very unstable models or actually run into errors thrown up by libraries as certian assumptions are violated.

The above process is often iterative in nature and goes on improving the data and the knowledge and insights from the data as you go through the same. It is only after this that you can start using the modelling techniques provided by various libraries and derive the benefits from the same.

It is often said that a data scientist spends a majority of his/her time on these steps more than in modelling itself. Without knowing any of the above, it would be futile to just learn modelling using libraries.

So, far we have just seen the basics of K-Means algorithms. They certainly help in the unsupervised clustering of data. However, we must realise that not all of this is completely foolproof.

There are a lot of decisions and conscious considerations to come up with useful clusters. Let us look at some of the important ones.

We have been saying that we will start with "K" clusters. But how do we decide what is the right value for K? Should we create 2, 3, 5, 10 clusters? What is the right number for K?

The first thing that comes to our mind is to look at scatter plots like the one here and it will be obvious as to how many clusters exist! Isn't it?

Yes, when we have a plot like the one here and we see the intuitive number of clusters, it is very easy to say that the data has 3 clusters and so K should be 3. In reality, however, data is rarely just 2 dimensional or even just 3 dimensional. The moment we move to data with multiple dimensions, visual representation is always not possible and we have to have other means to decide the number of clusters that we want to create.

So, what we are saying here is that **we want to find out the natural number of clusters** that already exist in my data without visually being able to see it. That is one criterion for sure, to decide on K.

For this, there may be many methods that aid you but two of those methods I would like to mention here: **silhouette analysis** and **elbow method**. They help in coming up with the right number of clusters using a quantitative method. They indicate the natural or intuitive number of clusters that exist in the data.

However, apart from the above mentioned qualitative methods, the business or domain knowledge would also have to be used to decide the K or the number of clusters. Even if silhouette analysis says 3 clusters exist, you might believe that it makes sense to have 4 clusters based on your business experience and knowledge. So, you could go with 4 and see you get the benefits of clustering from that.

Again, we have so far randomly chosen the initial cluster centres. If we choose the initial centroids randomly, we should be aware that we may not always end up with the same set of clusters!! Don't believe me?

Let us take an example data set and a random set of starting centroids. I chose completely different centroids to start with for the below data and the clusters I obtained were very different each time. The pink and blue are the clusters obtained each time I started with different random centroids.

*This was obtained using *__https://www.naftaliharris.com/blog/visualizing-k-means-clustering/,__* which provides a very good simulation for the k-means algorithm with various sample data.*

So you see, in certain types of data, the initial centres can have an impact on the clusters formed later. The clusters can keep varying. Hence, the initial centroids have to be chosen wisely.

So, what is the criteria or intelligence to be used to decide the right set of centroids - to begin with? One of the standard ways of choosing the initial centroid points is by an algorithm called the **K-Means++ algorithm.*** (which probably we can address in some later blog)*

At a very high level, I can summarise what this algorithm does is to help you pick up the farthest points possible as the initial centroids - again through distance calculations. And this is quite an intensive process if the data set is large.

From the above example scatter plot, you would have wondered if there are any natural clusters at all in the given data. In fact, that is uniformly spread, random data and hence is not suitable for forming clusters. So, before we jump into clustering any data, we have to check for what is called the "**Cluster tendency"** of the data. If the data shows no cluster tendency, there is no point in trying to cluster the data. It will be a futile effort. So, how do you check the **Cluster tendency**?

The Cluster tendency is given by a statistic known as the "**Hopkins statistic**". If the value of the Hopkins statistic is closer to 1, then, the data is said to be clusterable and if it is around point 0.5, the data is uniformly distributed - as in the above example with no cluster tendency. If the statistic is 0, then the data is in a grid format, again with no possibility of meaningful clusters. Hence you would look for a Hopkins statistic that is close to 1 before we embark on clustering.

For most algorithms, we know outliers play havoc unless treated. So it is in the case of K-Means clustering too. We must recollect that K-Means is heavily dependent on the calculation of means. And an average or mean is always impacted by outliers.

If there are outliers, they tend to pull the cluster away from its natural centre and sometimes probably bring along points that naturally belong to another cluster. This cannot be understood, till we treat or remove the outliers. The farther the outlier, the more impact it has on the homogeneity of the cluster and hence the formation of the right clusters.

Take a look at Figures 1 and 2. Figure 1 has no outliers. Figure 2 has one outlier at (60,2). You can notice that the natural clusters are so well-formed in the first case while the clusters are totally skewed in the second case. All of the data looks like 1 cluster and the outlier on its own as another, though it takes 2 more data points with it to form the 2nd cluster. Clearly, the intra-cluster homogeneity is very low here and the inter-cluster heterogeneity is also low, defeating the aim of clustering itself.

Hence it is quite imperative to treat outliers before clustering of data is undertaken. The various methods that can be employed to treat outliers are already discussed in an earlier blog of mine on the __Treatment of Outliers__.

Data on different scales is another aspect that impacts clustering considerably. Recollect that this algorithm works on distances. If some data is on a very large scale and some of the other data is on a much smaller scale, the distance calculation would be dominated by the dimension that is on a large scale and again impacts the cluster formation.

Though I am not getting into examples here, it is best to standardise all the data to a common scale. It means to have a 0 mean and 1 standard deviation to ensure that equal weightage is given to all variables in the cluster formation. Their spread and other characteristics remain the same but are scaled to a comparable scale. You can look up the various methods of feature scaling, as it is called in my earlier blog on the __importance and methods employable for feature scaling__.

The K-Means algorithm itself works on the concept of distance calculations and hence would not work with categorical variables. You might have to look at algorithms like the **K-Mode **algorithm for such data.

- K-Means works well only with spherical clusters and not with any other type of clusters that are non-spherical. For example, if there are two cylindrical data points parallel to each other. because it is based on distance minimisation, the clusters would be ill-formed in this case.
- It does not have the ability to learn by itself what the right K-value needs to be. this has to be given as input to it.
- It will go ahead and cluster any type of data even though there are no clusters naturally available. For example, even if we gave uniformly distributed data, it went ahead and created 2 clusters in the above example in Consideration 2.
- K-Means also gives more weightage to bigger clusters compared to smaller ones.

K-Means is an easy-to-understand algorithm and even easy to implement as long as all the practical considerations, discussed are kept in mind. It has two main steps of assignment and optimisation that are iteratively applied on the given data points till the centroids do not change anymore. A very simple and yet very powerful and useful algorithm in the unsupervised algorithm category.

We also looked at the mathematical representation of each of these steps and the cost function that we are missing here. This should give a good theoretical understanding of the K-Means algorithm.

Also, it has a few drawbacks that need to be kept in mind before using it for clustering. I hope to take you through an example with code in some later posts.

References:

]]>We know in Machine Learning, we need to understand what is the cost function that any algorithm is trying to work with, so that we can either minimise or maximise it. And further, automate it.

In this part 2, we would look at the mathematical representations of the cost function and the two steps of assignment and optimisation.

Straight away, I would start with the cost function represented by **J** and then explain it. The cost function is:

where K represents the number of clusters and “k” would be the cluster number to which the data point “i” belongs. k would range from 1 to K. Therefore for every “i” in cluster “k”, the Euclidean distance (square of the difference) between the ith data point and the centroid (represented by the mean mu-k) is taken and summed over all the points in that cluster. This is repeated for every cluster in the K clusters and the summation is done over all the K clusters.

Therefore, this is represented as two summations. One summation is the sum of all distances of points within one cluster from the centre. The next level summation is the sum of all the distances across all clusters.

Hence, **J** stands for the total squared error across all the clusters w.r.t their own centroids and that is what we want to minimize. If the clusters are very tight and good, then the overall cost function value will be very low. And that is what we are aiming for. In layman terms, that would mean that all the data points of a cluster are very close to the centroid.

Let us look at the similar mathematical representations of the 2 important steps: the **assignment** step and the **optimization** step.

This is the step where the distance between the centroid and every data point is calculated. This can be represented as (a function of the points in a cluster and the cluster mean):

given by

This distance is calculated K times for each point if there are K clusters and hence K cluster centroids. So, if we have 10 data points and 3 cluster centroids, we start with data point 1 and calculate 3 distances of point 1 from the 3 cluster centres. And the minimum distance among them is taken and the point 'i' (which is 1 in this case) is assigned to that cluster, say cluster 2. Likewise, we take data point 2 and then calculate the 3 distances from the 3 centroid points and assign data point 2 to the cluster which has a minimum distance from this data point.

So, what we are trying to do here is calculate the minimum (argmin) of the distance of every point with every cluster centre, shown as

This is repeated for every data point and the data point is assigned to the cluster whose centre it has a minimum distance with. In the above example, we would have calculated 10 points x 3 centroids = 30 distances before we have assigned all the 10 points to 3 clusters.

Now, having created the clusters based on the distance calculations, you recompute the cluster centre mu-k for K clusters. You calculate the mean or the cluster centre as shown here:

Where n-k is the number of data points that belong to the kth cluster and x-i are all the points that belong to the kth cluster. The mean is taken for only those data points that belong to that cluster. So, if cluster 1 had 5 points, the mean of all these 5 points is taken to come up with the new centroids. Note that the mean is taken for each dimension of the data point. If the data point is presented by x,y,z, then the mean is individually taken for x, y and z respectively to calculate the new centroid represented probably by its own x,y,z values.

This formula must look very familiar. it is nothing but the formula for mean.

This is what is repeatedly done for all data points of every cluster to get the new cluster centres of each cluster. In the example taken above, the means are found for all the 3 clusters and 3 news cluster centres are calculated.

**Stopping Point:**

The above two steps are repeated till we reach a point where the centroids do not move any further. That determines our stopping point of the algorithm and the clusters so formed are the most stable clusters.

These formulae can be used in an excel sheet, to begin with, along with the data points shared in Part 1 of this article and you can try it out on your own.

In Part 2, we have understood the mathematics that is used in K-Means. Nothing too difficult. However, there are quite a few practical considerations that impact the usefulness of the K-Means clustering. Each of those will be talked about in Part 3 of this series.

]]>We also need to understand its limitations, how to overcome them, its advantages and disadvantages.

I plan to explain the basics of K-Means clustering in a 3 part series. The first part, that is, this post, will take an example of very few data points and show how the clusters are formed.

The 2nd part to follow will talk about the cost function that is minimised in the K-Means algorithm with simple mathematical representations for the various steps.

The final 3rd part will talk about the practical considerations for the K-Means algorithm.

In K-Means the similarities between data points are decided based on distances between points and a centroid hence this is called a **centroid-based clustering **algorithm. If we want K clusters, we start with K centroids randomly chosen. Then, the distance of each point with these centroids is calculated and the points are associated with the centroid they are nearest to.

To recollect, the **centroid** of a set of points is another point having its own x and y coordinates that is the geometric centre of all the given points. This is calculated by taking the mean of all 'x' points to give the 'x' of the centroid and similarly average of all the 'y' points to get the 'y' point for the centroid. This is true, assuming that the set of given points have only two dimensions x,y. The same can be extended to n dimensions.

K-Means is an iterative algorithm that keeps creating new clusters with some adjustments till it finds a stable set of clusters and the points do not move from one to another.

The steps of this algorithm can be detailed as follows:

- We start with 'K' random points as initial cluster centres.
- Then, each point in the data set is assigned to one of these centres based on the minimum distance to the centre. (most often the
__Euclidean distance__) - Once a cluster is formed, a new centroid is calculated for the cluster i.e. the mean of all the cluster members
- Then, again distances are calculated for all the data points to the new centroids and the re-assignment happens based on minimum distance.
- Steps 3 and 4 are repeated till the points do not move any further from one cluster to another nor do the centroids move too much.

Let us take an example set of data points and see how this is done.

** Step 1**:

For the above data set, we take 2 centroid points at (10,8) and (18,6) randomly, to begin with, keeping in mind the range of the data points. These are represented by the yellow points in Figure 1.

__Step 2:__

Based on the Euclidean distance from all the points to these 2 centroids, two clusters have been formed in red and green. 7 points in the red cluster and 4 points in the green cluster. This is called the **assignment step**.

The formula for Euclidean distance is given by

__Step 3: __

Once these clusters were formed, we realise that the current centroids are not truly the geometric centres of their respective clusters. So, we find the new centroid by taking the geometric mean of all the x and y values of the 7 points in cluster 1. That value turns out to be (6.7, 6.1) as shown in figure 2.

Similarly, the new centroid is calculated for cluster 2 which shifts from (18,6) to (17.5,4.5), for the green points. This is called the **optimization step**.

__Step 4: __

Since the centroids have shifted, now it is worth recalculating the distance of all the points with respect to the new centroids. So, we calculate the euclidean distance of all the points with the new centroids again. We see that one of the earlier red points is now closer to centroid 2 or the green cluster's centroid and hence has been reassigned at the end of this cycle to cluster 2.

We repeat steps 3 and 4 (the assignment and the optimisation steps) and we get the plot as shown in Figure 3.

Here, centroid 1 has slightly shifted from (6.7, 6.1) to (5.6, 5.4) and centroid 2 has shifted from (17.5, 4.5) to (16.6, 5.6).

However, on recalculating the distances of all points with the new centroids, no points have moved from one cluster to another.

We repeat the process one more time to see if the centroids move or the clusters change.

And voila, neither of them change as shown in Figure 4. It means that we have reached an equilibrium stage and that the clusters are stable.

This is how K-Means clustering works!

Just clustering around a centroid point through calculation of Euclidean distances of all data points to the centroid and keep readjusting the centroid till it moves no further. Looks very simple, isn't it? Very easily automatable and can be represented mathematically too.

K-means clustering is an unsupervised machine learning model that essentially works through 5 steps of which two steps of "**assignment**" of points to clusters and "**optimization**" of the cluster through calculating a new centroid is iteratively repeated till we reach stable clusters.

The stability of clusters is defined by the points not jumping across clusters and the centroids not changing significantly, in subsequent iterations. The core idea in clustering is to ensure **intra-cluster homogeneit**y and **inter-cluster heterogeneity** to arrive at meaningful clusters. This has already been explained in my blog on "__Introduction to Clustering__"

In the next post, I will take you through the cost function for K-Means and a few mathematical formulae explaining the two important iterative steps. Till, then, see you :)

]]>This is an unsupervised learning technique where there is no notion of labelled output data or target data. An unsupervised method is a machine learning technique in which the models are not trained on a labelled dataset but discover hidden patterns without the need for human intervention.

A few unsupervised learning techniques apart from Clustering are Association and Dimensionality reduction.

Let us look at a few examples in order to understand clustering better.

If you are given a news feed from various news channels or portals and if you had to categorise them as politics, sports, financial markets etc. without knowing upfront, what categories exist, then, this is a typical clustering application. It may turn out that there are very standard categories that appear over and over again. Though the algorithm cannot name them automatically as sports or politics, it can cluster all the sports articles into one cluster and the political articles as another.

However, in certain periods, totally new categories may turn up. Like during the Olympics, a totally new category may be discovered as '"Olympics news" or the "Paralympics news". Being able to discover and identity newer clusters as they form and be able to categorise them as such is also a part of Clustering.

Another very common example is that of customer segmentation. If a large retailer wants to create promotions or marketing strategies based on customer's behaviour, it would need to be able to categorise or segment its customers based on their behaviours or demographics and so on. One way of segmenting could be based on spends. Another could be based on the age group related shipping habits. Yet another could be based o location and its influence like beaches versus high altitudes. It could be based on loyalists versus coupon lovers. These should be gleaned from the customer data the retailer has. Then, their promotions can be very targeted and the conversion rate could be improved immensely.

Clustering is also heavily used in the medical field like human gene clustering and clustering of organisms of different species or within a species too.

Note that in each of the cases there are no labels attached to the data. It is after the clusters are formed that you can get actionable insights from the clusters that are created.

There are many types of clustering algorithms of which here are the top 4 well-known ones:

- Connectivity-based Clustering
- Centroid-based Clustering
- Distribution-based Clustering
- Density-based Clustering

Each of them has its own characteristics with its own advantages and disadvantages. Today, I will provide a brief introduction to a couple of clustering algorithms

- K-Means algorithms which is one of the well-known clustering algorithms is a centroid-based algorithm.
- Hierarchical clustering is a connectivity-based clustering algorithm.

All clustering algorithms try to group data points based on similarities between the data. What does this actually mean?

It is often spoken of, in terms of **inter-cluster heterogeneity** and **intra-cluster homogeneity**.

** Inter-cluster heterogeneity:** This means that the clusters are as different from one another as possible. The characteristics of one cluster are very different from another cluster. This makes the clusters very stable and reliable. For example, if a cluster of customers is created based on highly populated areas versus thinly populated areas. If the difference in the population is distinct like in cities and villages, they turn out to be very stable and distinct clusters.

** Intra-cluster homogeneity: **This talks about how similar are the characteristics of all the data within the cluster. The more similar, the more cohesive is the cluster and hence more stable.

Hence the objective of clustering is to **maximise the inter-cluster distance **(Inter-cluster heterogeneity)** **and **minimise the intra-cluster distance** (intra-cluster homogeneity )

This is one of the most popular clustering algorithms and one of the easiest as well. Here, we are looking to create a pre-determined "**K" number of clusters. **

Here the similarities or lack of similarities between data points, is decided based on distances between points. The distance is measured from a **centroid** and hence this is called a **centroid-based clustering **algorithm. If we are starting with K clusters, we start with K centroids randomly chosen (or there is more science to it). Then, the distance of each point with these centroids is calculated and the points are associated with the centroid they are nearest to. *Note that it would be ideal to have the centroids placed as far away as possible from each other as possible.*

Thus clusters of data points are formed. The centres/centroids are recalculated and again the steps are repeated till the points don't seem to be moving from one cluster to another.

The distance formula used here is the Euclidean distance between every point and the centroids. The closer the points are to each other, the greater the chance of belonging to the same cluster.

*Euclidean distance is very simple high school geometry. A very simple explanation of the same can be found *__here__*.*

We will look at a practical example with data in a subsequent article but for now, we can summarise our understanding as that the clusters are formed based on the distances between points in K-means where K stands for the number of clusters we have decided to create or glean from the data.

Here is a graph showing how shoppers have been clustered based on the amount they spend at the shop and the frequency of orders they place.

Notice, 3 clusters have been formed with the red showing customers who spend small amounts and shop very frequently. The black cluster shows the group of customers who shop less frequently but spend large amounts. The blue cluster shows the customers who spend small amounts and are not so frequent shoppers either, the least profitable customers.

One of the biggest disadvantages of K-Means clustering is that you have to decide or choose upfront, the K value. i.e. the number of clusters. This is overcome in the hierarchical clustering.

In Hierarchical clustering, you either start with all data as one cluster and iteratively break down to many clusters depending on similarity criteria or you start with as many clusters as your data points and keep merging them till you get one large cluster of all data points. This leads to hierarchical clusters of 2 types:

- Divisive and
- Agglomerative

The positive point of hierarchical clustering is that you do not have to specify upfront how many clusters you want.

It creates an inverted tree-shaped structure called the **dendrogram**. A sample dendrogram is shown here:

Because of this tree structure and the way the clusters are formed, it is called hierarchical clustering. *This figure shows the clusters created for the same customer data that was used to derive the 3 clusters of customers in the above graph under K-Means. *Here too you can see it suggests 3 clusters through green, red and turquoise clusters.

Interpreting the dendrogram is the most important part of the hierarchical clustering. Typically one looks for natural grouping defined by long stems.

The height of the dendrogram at which the different clusters are fused together represents the dissimilarity measure. Often it is the Euclidean distance again, that is calculated to understand how similar or dissimilar two points are to each other and that is represented by the height in the dendrogram. Here the distance is calculated between points themselves and not any centroid. Hence it is called connectivity-based clustering algorithm.

The clusters that have merged at the bottom are very similar to each other and those that merge later towards the top are the most dissimilar clusters. We will talk about linkages and its types in a later article that goes into more details about hierarchical clustering.

Here is a brief description of the two types of hierarchical clustering and how they differ from each other though both end up creating dendrograms.

This starts with all the data in one large cluster, from where it is divided into two clusters based on the least similarity between them and further divided into smaller clusters until a termination criterion is met. As explained earlier, this is based on connectivity which is essentially saying that all the points close to each other will belong to one cluster.

In this, the clusters are formed the other way round. Every data point is taken as its own cluster, to begin with, and then it starts aggregating them into most similar clusters till it goes up to making one single cluster of all the data available. Hence it is also known as the bottom-up method.

The difference between these two methods is pictorially represented here.

Clustering algorithms are unsupervised algorithms. There are many types of clustering algorithms each with its own advantages and disadvantages. The inter-cluster heterogeneity and the intra cluster homogeneity play a huge role in the creation of clusters.

Out of the four categories of clustering algorithms, we have looked at examples of two common types of algorithms - the K-Means and the Hierarchical clustering, at a very high conceptual level. The mathematics and the intricacies will be looked at, in subsequent articles.

]]>However, coming back to the topic on hand, all on the Machine Learning Journey start with learning about Linear regression - almost all who are serious learners :) Initially, it seems too simple to be of use for predictions. But as you learn more and more, you do realise that it can be a solution for a good set of problems. However, can you use Linear regression for any problem on hand? or do you have a set of constraints that you need to be aware of, so that you use it in correct scenarios?

If you have read my articles so far on Linear regression, starting from

__Regression Algorithms____Linear Regression through Code - Part 1____Linear Regression through Code - Part 2__

Going all the way through the various concepts used in the above articles, individually as part of these articles

__Feature Scaling and its Importance____Outliers and their treatment____Multicollinearity____Feature Selection in Machine Learning____Prediction and Forecasting in Machine Learning__

there are hardly any assumptions mentioned about linear regression perse.

The only assumption, if at all, that too very implicitly is that there must be a linear relationship between the target and the independent variables. And that is the reason we are able to express that relationship as in the given equation here:

where the Xs are the independent variables and the Y is the dependent variable. The betas are what the model comes up with for the given data, of course with the epsilon as the mean zero error or the residual term.

However, this is not the only assumption that is true in the case of linear regression, there are other assumptions too to make the inferences of a model reliable. This is because we are still creating a model from a sample and then trying to use that model for a general population. This implies that we are uncertain about the characteristics of the larger population and that needs to be quantified. Hence we need to have a few assumptions about the data distribution itself.

If any of these assumptions do not turn out to be true with the data that you are working on, then the predictions from the same would also be less reliable or even completely wrong.

There are 4 assumptions that need to hold good including the one already stated. They are

- A Linear relationship exists between Xs and Y
- The error terms are normally distributed
- The error terms have a constant variance ( or standard deviation) This is known as homoscedasticity
- Error terms are independent of each other

Clearly, there are no assumptions about the individual distributions of X and Y themselves. They do not have to be normal or Gaussian distributions at all.

Let us understand the assumptions. The first one is obvious.

What does the second assumption mean?

When we are fitting a straight line for Y vs X, there can be a whole host of Y values for every X. However, we take the one that best fits the line. The actual point may not be on the line and that gives us the residual or error.

In the figure below, the e65 is the error at x = 65. e90 is the error at x=90.

Therefore, the Y at x = 65 would be

and Y has an error e65.

This error itself can be anything. But considering that we want e (epsilon) to be a mean zero error, we will be fitting the line in such a way that the errors are equally distributed either positively or negatively around the line. That is what would be deemed the best fit line.

Since in linear regression, the data points should ideally be equally distributed around the best fit line, to ensure that the mean residual is zero, this makes the distribution of errors a normal distribution.

If you plot the residuals of your sample data, this is the kind of graph you should get.

This is the second assumption

This is the 3rd assumption. This is also known as homoscedasticity. The errors have a constant variance (sigma-squared) or a constant standard deviation (sigma) across the entire range of the X values.

This is to say that the error terms are distributed with the same normal distribution characteristics (defined by mean, standard deviation and variance) through the data range.

See the patterns of residual plots in the above figure, the plot (a) shows no specific pattern in the residuals implying that the variance is constant. In such a case, linear regression is the right model to use. In other words, it means that all the possible relationships have been captured by the linear model and only the randomness is left behind.

In plot (b) you see that the variance is increasing as the samples progress, violating the assumption that the variance is a constant. Then, linear regression is not suitable in this case. In other words, this means that the linear model has not been able to explain some pattern that is still evident in the data.

If the data is heteroscedastic, then it means that

This is the 4th assumption that the error terms are not dependent on each other and have no pattern in themselves if plotted out. If there is a dependency, it would mean that you

have not been able to capture the complete relationship between Xs and Y through a linear equation. There is some more pattern that is visible in the error.

Getting a residual plot like this shows that the variance is constant as well as the fact that the error terms are independent of each other.

These assumptions are necessary to be tested against the predicted values by any linear model if you want to ensure reliable inferences.

The meaning of these assumptions is - what is left behind (epsilon/error) that is not explained by the model is just white noise. No matter what value of Xs you fit in, the error's variance (sigma-square) remains the same. For this to be true, the errors should be normally distributed and have a constant variance, with non-dependence on each other.

This also implies that the data on hand is IID data or Independent and Identically distributed data, that is suitable for linear regression.

In layman terms, all these assumptions go to say that the dependent data has a truly linear relationship with the independent variable(s) and hence it is explainable with a linear model. We are ensuring that we are not force-fitting a linear model on something that is not linearly related and that probably there exists a relationship that is either exponential, logarithmic or some relationship explained by higher-order equations.

Hence, you need to test for each of these assumptions when you build your linear models to use the inferences with confidence.

Reference:

]]>Let us understand the nuances of each of these today.

**Prediction**, as the word says, is about estimating the outcome for unseen data. For this, you fit a model on a training data set and use that model to predict the outcome for any unseen data.

In prediction, we do not make any assumptions about the shape of the data except that there is a linear dependency of the target variable on the independent variables and that is explained by a model** f(x)** (x could be a set of predictors, x1, x2, x3,..., xn)

Once the model is known, the model is used to **interpolate** a target variable based on a new set of unseen independent variables.

Though it may not always be true, most often, we use predictive models for understanding the impact of the independent variables on the target variable. Hence you want to keep the model as simple as possible.

For example, you have a use case where the number of viewers of a TV show is reducing. You want to fit a model to understand this behaviour, based on various factors like the actors, the plot of the show, the days of the week the show airs, the competing shows that have come at the same hour etc. You do a "multiple linear regression" and get a model. As soon as you have the model, you can see which predictors are more influential and with that insight, you can take corrective actions. You can even predict based on a few tweaks, what is the impact on viewership.

Here you are not very keen on high accuracy of the prediction but you are keen on knowing the cause for the change in the outcome. Actionables are expected usually, from these predictive models.

**Forecasting** is a sub-discipline of Prediction where you are predicting for a future point in time.

For example, weather forecasting. We would not say weather prediction. Similarly, we say, sales forecasting. Given a lot of historical sales, you come up with a model, using which you forecast for a future date.

This kind of sales forecasting is valid provided the conditions remain the same as the conditions of the training data. If the training data is for a non-festive period, then, the forecast will also work for the same. But it will certainly not work for a festive period.

This implies that we making an assumption that the conditions remain the same for the forecast to hold true. If this assumption changes, then, the forecast could completely go wrong.

In fact, the language used when you give a forecast, is different. You often say that conditions remaining the same, the forecast is a specific value. This is seen as an **extrapolation** of the data from the existing time frame to a future time frame.

Also, regarding the outcome, here most often, you are looking for higher accuracy and not really for understanding the impact of the independent variables on the target variable. Hence there is a tendency to make the model complex, as the goal is different.

Let us look at a very simplistic example. You have data that shows how salary varies with years of experience. Clearly, there is a linear dependence between these two variables as shown in the diagram below

Note here, if you want to "predict" what might be the salary for someone with 6.5 years of experience, you can interpolate as shown by the green dot and line and you can derive that the salary is probably around 90000 units. Using this linear equation formula you can predict for any number of years of experience in the given range of 1.1 to 10.3 years of experience, for which we have data. You could go beyond 10.3 years too but no guarantees there as you have no data to substantiate that the prediction is still linear after 10.3 years.

Let us look at an example that has the temporal component to it.

This is a graph that shows the increase in Airline passenger traffic with time. This has a linear component and a seasonal component. For the discussion sake here, kindly ignore the seasonal component. Assume, you separate out the linear growth and the seasonal shape and you will see that over the years, there is steady linear growth. You can use this data to come up with the linear growth expected, say in 1962.

You are using all of the historical data to come up with what might be a future passenger load in 1962, which is nothing but an extrapolation. This is called forecasting.

It must strike you here that the forecasting will be accurate only if the conditions of the historical data and the future remain the same.

Most often, non-temporal use cases with interpolation are termed predictions and temporal based extrapolation is called forecasting.

Kindly share your thoughts below, it will be highly appreciated :)

]]>It is, of course, science for all the mathematical rigour it goes through for being a selected feature. Today, we will first understand why is feature selection an important aspect of Machine Learning and then, how do we go about selecting the right features.

Suppose we have 100 variables in our data, as potential features. In that, we want to know what are the best set of features that will give the highest accuracy or precision or any metric that you are looking for. We also want to know which of these even contribute towards predicting the target variable. It will be a lot of trial and error before you can find that out.

One way would be to use a brute force method. Try every combination of variables available and check which predicts best. Implying that we try one variable at a time for all hundred variables, two at a time for all combinations within the 100 variables, three at a time with all combinations and so on. This would lead to 2 to the power of 100 combinations.

Therefore, if we even have just 10 independent variables, it would be 1024 combinations. And if the number of variables increases to 20, the combinations snowball to 1048576. hence this does not sound like an option at all.

Then, how do we go about selecting the features? There are two ways of dealing with it - Manual or automated feature elimination.

As you can guess, Manual Feature elimination is possible and done when the number of variables is very small (say 10 to a max of 15) as it becomes prohibitive to do that when the number of features becomes large. Then you have no choice but to go for automated feature elimination.

Let us look at both of them.

As already mentioned, this is possible only when you have fewer variables.

The steps involved are:

- Build a model
- Check for redundant or insignificant variables
- Remove the most insignificant one and go back to step 1

Right. You build the model and then you try to drop those features that are least helpful in the prediction. How do you know that a variable is least helpful? Two factors can be looked at. Either the **p-value** of all the variables or **VIF (Variance Inflation Factor) **of all the variables

P-value is a concept that is part of hypothesis testing and inferences. I do not plan to explain that today. Hoping you are aware of it if you have done any regression modelling. But, in a nutshell, you need to know the following about p-value:

- P-value is a measure of the probability that an observed difference could have occurred just by random chance.
- The lower the p-value, the greater the statistical significance of the observed difference

Therefore, if any variable exhibits a high p-value, i.e. greater than 0.05, typically, you can remove that feature.

To know more about VIF, please refer to my article on __Multicollinearity__ Just to summarise here, VIF gives a measure of how well one of the predictors can be predicted by one or more of the other predictors. This implies that that predictor is redundant. If VIF is high, there is a strong association between them. Hence that predictor can be removed.

Similarly, if you get a VIF of greater than 5 (just a heuristic), that feature can be eliminated as it has great collinearity with some of the other features and hence is redundant.

This process is repeated one variable at a time and the model is rebuilt again. Then, similar checks are made and any other insignificant or redundant variables are removed one by one, till you have only significant variables contributing to the model.

As you can see this is a tedious process. Let us see an example before we go to Automated feature elimination

This example is for predicting house prices given with13 features. A heatmap of the features is shown here:

I start with 1 feature which seems highly correlated with the price i.e. '*Area'*.

When I create the linear regression model with just this variable, I get the summary as this:

(This is marked as **Step 1 **in the code provided in the Jupyter notebook later)

```
OLS Regression Results
===================================================================
Dep. Variable: price R-squared: 0.283
Model: OLS Adj. R-squared: 0.281
Method: Least Squares F-statistic: 149.6
Date: Mon, 25 May 2020 Prob (F-statistic): 3.15e-29
Time: 09:43:04 Log-Likelihood: 227.23
No. Observations: 381 AIC: -450.5
Df Residuals: 379 BIC: -442.6
Df Model: 1
Covariance Type: nonrobust
===================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------
const 0.1269 0.013 9.853 0.000 0.102 0.152
area 0.4622 0.038 12.232 0.000 0.388 0.536
===================================================================
Omnibus: 67.313 Durbin-Watson: 2.018
Prob(Omnibus): 0.000 Jarque-Bera (JB): 143.063
Skew: 0.925 Prob(JB): 8.59e-32
Kurtosis: 5.365 Cond. No. 5.99
===================================================================
```

The R-squared value obtained is 0.283. We should certainly improve the value. So we add the second most highly correlated variable, i.e. '*bathrooms'*. (This is **Step 2** in the code)

Then, the R-squared value improves to 0.480. Adding a third variable '*bedrooms*' (**Step 3**) improves it to 0.505. Then, I have added all the 13 variables (**Step 4)**, the R-squared changes to 0.681. So, clearly, not all variables are contributing in a big way.

Then, I use VIF to check the redundant variables. (code snippet here)

```
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif
```

The result I get is:

Clearly, we see that the '*bedrooms*' has a high VIF, implying that it is explainable by many other variables here. However, I also check the p-values.

In the p-values, I see that '*semi-furnished*'' has a very high p-value of 0.938. I drop this and rebuild the model (**Step 5**)

When I check the p-values and the VIF again, I notice that there is one variable '*bedrooms*' that has both a high VIF of 6.6 and a high p-value of 0.206. I choose to drop this then. (**Step 6)**

Finally, in **Step 7**, I note that all VIFs are below 5 but '*basement*' has a high p-value of 0.03. This is dropped and the model is rebuilt.

This leads us to a place where all remaining features are showing a significant p-value of <. 0.05 and VIFs are < 5.

These remaining 10 features are taken as the selected features for model development.

Here is the Jupyter notebook showing all the steps explained above.

Now, let us see how can we improve with the help of Automated feature elimination.

There are multiple ways of automating the feature selection or elimination. Some of the often used methods are:

- Recursive Feature Elimination (RFE) - Top n features
- Forward, Backward or Stepwise selection - based on selection criteria like AIC, BIC
- Lasso Regularization

Where AIC is __Akaike Information Criterion__ and BIC is__ ____Bayesian Information Criterion____ __- different criteria that are used for model comparison

We will theoretically look at each of these before I share a code based example for one of these methods.

This is where we give a criterion to select top '**n'** features and the n is based on your experience of the domain. It could be the top 15 or 20, totally depending on how many features you think are truly influencing your problem statement. This is clearly an arbitrary number.

Upon giving the features and the 'n' value to the RFE module, the algorithm goes back and forth with all the given features and then comes up with the **top n features**, that have the maximum influence or are least significant.

Forward Selection is where you pick a variable and build a model. Then, you keep adding a variable and based on the AIC criteria, you keep adding till you don't see any further benefit in adding.

Backward Selection is when you start with all features and you keep removing a variable at a time till you see the metric improves no more.

Stepwise is where you keep adding or removing and trying out till you get a good subset of features that are contributing to your metric

In reality, Stepwise is the popular way though Backward and Stepwise seem to give very similar results. This is all automatically done by libraries that have implemented these already.

This form of regularization makes the coefficients of the redundant features zero. Regularization is a topic that can be looked at in-depth in another article.

Here is the Jupyter Notebook with the same house price prediction example with recursive feature elimination:

**A brief explanation here:**

I am using the **RFE** module provided by **SciKit Learn**. Hence I use the Linear Regression module too from the same library as that is a pre-requisite for RFE to work. Follow through from Step 1, as all other steps before that are preliminary. data preparation steps.

In Step 1, RFE() is passed the model already created as '*lm*' and *10*, to say I want the *top 10 features*. It immediately marks the top 10 features as rank 1.

This line helps us see what are the top 10 features

`list(zip(X_train.columns,rfe.support_,rfe.ranking_))`

I take only these features to start building my model now.

The rest of the steps from Step 2 show how to use the Manual Elimination method after RFE, and this is called the **Balanced Approach**. What is this approach? I will explain this approach before we go back to understanding the code.

This is the most pragmatic approach that employs both types of feature elimination - a combination of Automated and Manual. When you have a large number of features, say 50, 100 or more, you use automated elimination to reduce the total number of features to the top 15 or 20 features and then you use manual elimination to further reduce it to select the truly important features.

The automated method helps in coarse tuning while the manual method helps in fine-tuning the features selected.

In the code, I used RFE to come to the top 10 features. Once I have got the top 10 features, I go back to building the Linear regression model, checking for p-values and VIF values and then deciding what more needs elimination.

For that, I build a Linear regression model using the **statsmodel** library as I can get the p-value from the summary provided by this model. (*I do not have this option in the LinearRegression module of SKLearn.)*

I see that the R-squared value is pretty good at 0.669. However, I see that '*bedrooms*' is still insignificant and hence drop that variable in Step 3.

Upon rebuilding that model, I see that there are no high p-values. I check VIF and notice all are below 5. Hence these 9 features are shortlisted as the final set of features for the model.

It is important to use only the features that contribute towards predicting the target variable and hence feature selection or elimination is important. There are many ways of doing it and Recursive feature elimination is one of the automated ways.

Manual feature elimination has been discussed to appreciate the concept of feature elimination but in practical circumstances, it will be rarely used. It is useful only if we have very few features.

There are more advanced techniques of feature elimination like Lasso regularization too.

- P-Value definition:
__https://www.investopedia.com/terms/p/p-value.asp__

The ability to run Challenger and Champion models together on all data is a very genuine need in Machine Learning, where the model performance can drift over time and where you want to keep improving on the performance of your models to something better always.

So, before I delve deeper into this architecture, I would like to clarify some of the jargon I have used above. What is a Champion model? What is a Challenger model? What is model drift and why does it occur? Then, we can look at the rendezvous architecture itself and the problems it solves.

Once you put your model into production, assuming it will always perform well is a mistake. In fact, it is said - "**The moment you put a model into production it starts degrading**". (*Note, most often '**performance**' in ML is used to mean statistical performance - be it accuracy, precision, recall, sensitivity, specificity or whatever the appropriate metric is for your use case*).

Why does this happen? The model is trained on some past data. It performs excellently for any data with the same characteristics. However, as time progresses, the actual data characteristics can keep changing and the model is not aware of these changes at all. This causes model drift i.e. degradation in model performance.

For example, you trained a model to detect spam mail versus ham mail. The model performs well when deployed. Over time, the types of spam keep morphing and hence the accuracy of the prediction comes down. This is called **model drift**.

The model drift could happen because of a **concept drift** or a **data**** drift.** Not getting into these today. It will suffice us to understand that the performance of a model does not remain constant. Hence we need to continuously monitor the performance of a model. Most often, it is best to retrain the model with fresher data more frequently or probably based on a threshold level in performance degradation.

Sometimes, even retraining the model does not improve the performance further. This would imply that you might have to understand the changes in the characteristics of the problem and go through the whole process of data analysis, feature creation and model building with more appropriate models.

This cycle can be shortened if you can work with Challenger models even while we have one model in production currently. This is a continuous improvement process of Machine Learning and very much required.

Typically, the model in production is called the **Champion** model. And any other model that seems to work well in your smaller trials and is ready for going into production is a **Challenger** model. These Challenger models have been proposed because we assume there is a chance that they perform better than the Champion model. But how do we prove it?

A Champion model typically runs on all the incoming data to provide the predictions. However, on what data does the Challenger model run?

There are two ways that the Challenger models can be tested. The ideal case would be to run the Challenger model in parallel with the Champion model on all data and compare the results. This would truly prove the Challenger model can perform better or not. However, this seems prohibitive, especially in the big data world, and hence challenger is always trialled out on a subset of the incoming data. Once that seems to perform well, it is gradually rolled out to more and more data, almost like alpha-beta testing.

As you might be aware, that in alpha-beta testing, a small percentage of users or incoming data in this case is sent through a new test or Challenger pipeline and the rest all go through the original Champion pipeline. This kind of alpha-beta testing is good for some applications but clearly not very impressive in the world of machine learning. You are not comparing the models on the same data and hence can rarely say with confidence that one is better than the other for the whole data. There could be lurking surprises once you roll it out for all data and the model drift can start sooner than expected.

A typical alpha-beta pipeline would look like this.

The data is split between the two pipelines based on some criteria like the category of a product. This data split keeps increasing towards Challenger as the confidence in the performance of the Challenger model grows.

From a data scientist perspective, this is not ideal. The ideal would be to be able to run the Challenger model in parallel for **all the data** along with the Champion model. As I earlier said, this is very expensive.

Consider the worst-case scenario. If you want them to run in parallel, you have to set up two data pipelines that run through all the steps independently.

It would look something like this:

This has huge engineering implications and hence time to market implications too. The cost of this can get prohibitive over time.

A few of the top implications are the time and effort in building these pipelines over and over again without being sure if the Challenger model is indeed going to perform as expected. The CICD process, the deployments, the monitoring, authentication mechanisms etc. are a few to mention. In addition, the other cost is around the infrastructure that has to be doubly provisioned.

Considering if these pipelines are big data pipelines, it becomes all the more significant. Very soon you realise that this is not a scalable model. We certainly have to see how we can move away from parallel pipelines or even from the alpha-beta testing method.

As a corollary, the best-case scenario would be when we can reuse much of the data pipelines. The idea is to minimize the amount one has to develop and deploy again into production. This would also ensure optimization of infrastructure usage. This is one line of thinking about how to optimize.

Even better would be to be able to just **plug in the Challenger model** and the rest of the pipeline plays as if nothing has changed. Wouldn't that be fantastic? And this is what is made possible by the **Rendezvous architecture.**

The Rendezvous architecture as written in the book is tilted towards ML with smaller data. I have tweaked it to meet the needs of the big data world and associated pipelines as shown in the diagram below: *(References to the book and another article are given below in the references section)*

Let me now explain section by section of this architecture:

This consists of the standard data pipeline for receiving incoming data, cleansing it, preparing it and creating the required features. This should be just one pipeline for every model that is to be deployed. The prepared data should maintain a standard interface that has all the features that may be required in that domain irrespective of the model on hand. (*I understand this is not always possible and may need tweaking piecemeal over time. But we can deal with that piece in isolation when required*)

This is a messaging infrastructure like Kafka that comes into play bringing in a sense of asynchronicity. The data that is prepared as features are published onto the message bus. Now, every model listens to this message bus and triggers off, executing itself with the prepared data. This message bus is what enables a plug and play architecture here.

This is the part where all models are deployed one by one. A new Challenger model can be deployed and made to listen to the message bus and as data flows in, it can execute. Any number of models can be deployed here and not just one Challenger model! Also, the infra requirement is only for the extra model to run. Neither the pre-model pipelines nor the post model pipelines need to be separately developed or deployed.

As you can see in the figure, you can have many challenger models as long as the data scientist sees them mature enough to be tested against real data.

Also, there is a special model called the decoy model. In order to ensure that each of the model processes is not burdened with persistence, the prepared data is also read by what is called a **decoy model,** whose only job is to read the prepared data and persist. This helps for audit purposes, tracing and debugging when required.

All these models again output their predictions or scores into another message bus thus not bringing any dependency between themselves. Also, again this plays an important role in ensuring the pluggability of a model without disrupting anything else in the pipeline.

From there the rendezvous process picks up the scores and decides what needs to be done, as described in Part 5.

This is where the new concept of a **Rendezvous process i**s introduced, which has two important sub-processes. One immediate subprocess takes care of streaming out the correct output from the pipeline for a consumer from among the many scores it has received and the other process is to persist the output from all models for further comparison and analysis.

So, we have achieved two things here:

- The best output is provided to the consumer
- All the data has gone through all the models and hence they are totally comparable in like circumstances, on their performance

How does it decide which model's output should be sent out? This can be based on multiple criteria like a subset of data should always be from Challenger and another subset should always be from Champion. This is almost like achieving the alpha-beta testing. However, the advantage here is that while it sounds like alpha-beta testing for a consumer, for the data scientist, all data has been through both the models and so they can compare the two outputs and understand which is performing better.

Another criterion could be that the output should be based on model performance. In this case, the rendezvous process waits for all models to complete and publish to the message bus. Then, it seeks the best performance metric and sends out that as the result.

Yes, another criterion can be that of time or latency. If we need to have the result in say less than 5 seconds, for example, the process waits for all the results from models, up to 5 seconds, compares only those and returns the best data. Even though another model comes back in the 6th second that may have performed much better, that is ignored as it does not meet the latency criteria.

But how does this process know what is the criteria to follow for which data or which model? This can be put in as part of the input data into the message bus in Part 2. Note that the Rendezvous process is also listening to these messages and gets to know what to do with the output that corresponds to an input. There could be other clever ways too but this is one of the ways proposed.

By introducing asynchronicity through message buses, a level of decoupling has been introduced bringing in the ability to play and play models into an otherwise rigid data pipeline.

By introducing the rendezvous process, the ability to select between various model outputs, persist them, compare them were all introduced. And with this, it now does not seem a herculean task to introduce or support any number of new models for the same data set.

The rendezvous architecture gives great flexibility at various levels.

- A variety of criteria can be used to decide what score, output or prediction can be sent out of the prediction process. These criteria could be latency based, model performance-based or simple time-based.
- It provides the ability to define and change these criteria dynamically through the rendezvous process. You can take it to another level by introducing a rule engine here
- It provides the ability to make all the data go through all the pipelines or choose only a subset to go through many pipelines. For example, if you have grocery and general merchandising products for which you are forecasting, groceries could go through their own Champion and Challenger models while general merchandise which are typically slow sellers can have their own pipelines.
- It also provides the ability to run many models at a time without redeveloping or redeploying a large part of a big data pipeline. Apart from effort and time savings, the infrastructure costs are also optimised

__Machine Learning Logistics__- An article on ,
__towardsdatascience.com:__",__Rendezvous Architecture for Data Science in Production__" by Jan Teichmann

**Big data architectures **provide the logical and physical capability to enable high volumes, large variety and high-velocity data to be **ingested, processed, stored, managed** and **accessed**.

The marriage of these two opens up immense possibilities and large enterprises are already leveraging the benefits. To understand how to bring the two together, we would first need to understand them individually.

The Machine learning architecture is closely tied to the process of ML as described in my earlier article: "__Machine Learning Process - A Success Recipe__"

As a quick recap, a typical ML process would involve the steps depicted here:

It has a two-phased process of learning and predicting, the former feeding into the latter.

However, when the Machine Learning Model has to be put into production, a few more aspects have to be taken care of, as shown here;

The aspects added are cross-cutting** concerns** and are shown by the four layers at the bottom of the diagram.

**Task Orchestration:** the ability to orchestrate tasks like feature engineering, model training and evaluation on computing infrastructures like AWS or Azure. Dependency management would be an important aspect here and most often it is non-trivial.

**Infrastructure: **Provisioning of infrastructure, providing elasticity of the same through options like containerization is essential.

**Security:** With that, the additional layer of security through authentication and authorization needs to be added.

**Monitoring: **Continuous monitoring of the infrastructure, the jobs and the performance are all non-trivial aspects to be taken care of in production.

The final aspect of providing **feedback about the statistical performance** of the model itself giving opportunities to auto-tune the model would be a great value add. (shown by the dotted line from adaptation to Data collection).

It goes without saying that the code written, should follow best practices of modularity and __SOLID__ principles leading to maintainability and extensibility.

All of this is good as long as the scale of data does not cross what can be handled by single large machines. In the realm of single large machines, all of this would ideally be deployed as containerized applications or traditional n-tier architectures with their own data stores, processing capabilities and would expose the models through APIs,

But the moment the scale of data crosses such a boundary, the only way to handle is to use **distributed architectures**. The Big data stack provides one such ecosystem whose primary functioning is based on distributed computing and storage principles. Let us understand Big data architecture and its capabilities.

Let us now understand a typical application architecture on a big data platform. This would include building data lakes and the ability to serve its various customers, typically consisting of data analysts, business analysts, data engineers.

*And when ML and Big data come together, the customers include data scientists and ML engineers too. (which I will address in the next section)*

This is a generic architecture that should serve most enterprise data lakes - both from the perspective of building the lakes as well as serving data from the lakes for various use cases and stakeholders. No technology stack other than Hadoop is mentioned here, as each of the components have multiple options and should be evaluated based on the use cases of the organization.

This architecture has multiple elements in it - the Ingress pipeline, the various data zones, the data processing pipelines, the streaming layer, the egress and the serving layers. Each of these components have to be well thought through to ensure they serve almost all the use cases of an organization.

**The Ingress pipeline. **All data coming into your data lake should come through a common mechanism, so that data governance, data lineage management and aspects of data security can all be centralised and governed well. This part can grow into an unmanageable nightmare if you allow multiple ETL (Extract, Transform, Load) or ELT (Extract Load Transform) pipelines.

**The Landing Zone:** All data that comes in and does not need near-real-time processing lands here and is maintained for a pre-defined duration in the original raw format, for audit and traceability purposes. Practices of regular clean up have to be put into place.

**Data Validation:** Here is where all the types of data validation are done. Where possible, you can validate the data by comparing with the source and where not possible, validate the data for its own semantics as described in detail in my article on "__Data Validation - During ingestion into data lake__". As there are no ready-made tools for this, building a framework will take you a long way.

**Data Lake:** Data lake is where you have data that is trustworthy, to be served to all of the consumers. However, this data is still in its original form, albeit clean. This data as it is, is very useful for deriving insights. Considering that this is a big data platform, you can allow years of historical data to grow. This is immensely valuable for an organization that believes "Data is the new gold". Data can be read from here directly but most often requires further transformations.

**De-normalized layer and Data Cubes**: As the data is huge, joining data and deriving insights becomes a highly expensive process. Hence, one of the best practices is to be able to create a de-normalized layer of data for each domain in the organization. Then, all the users of that domain can get what they are looking for without expensive processing over and over again. The denormalized layer is almost equivalent to the facade design pattern. While the sources of data may change, as long as the domain picks up from the denormalized layer, it is protected from the change in sources.

Also, if very similar aggregations are required repeatedly, building cubes of data with pre-aggregation could be a good idea. You could even introduce big data OLAP capabilities here so that it can be served to reporting tools more natively. Some of the big data OLAP tools have been discussed in my article "__Hadoop for Analysts"__

**Egression or Serving Layer:** Once the data is processed, transformed and available, you have to be able to serve this data through a serving layer. This could be providing APIs through various technologies. An API can serve data right out of the Hadoop platform or you could publish this data out of Hadoop. An egress framework here would ensure that data produced within the data lake can be made available for all types of consumers to consume in batches or even near real-time.

If all the above aspects are taken care of, you have a working architecture for building data lakes and using them on a big data platform.

Having understood both the architectures independently, we need to see how they can work together and allow for new possibilities.

Since Machine Learning is all about "**Learning from Data**" and since Big data platforms have data lakes consisting of all the data one can have, it is but logical that they come together and provide even more insights and even better predictions opening up opportunities to businesses as never seen before.

Have a look at the amalgamated architecture. All you have to do is extend your data pipelines to now support machine learning too.

Most of the architecture looks very similar to the big data architecture, right? and yes, that's the point. Just extending it a little, as shown in the red dotted lines, gives your machine learning models, the power of a big data platform.

Let us focus on the pipeline starting with feature engineering up to predictions. Now you can use the data from the data lake and transform it into required features using the power of a distributed platform. The features can be stored in a feature repository that feeds into the models that are being trained.

The output of parametric models (like logistic regression) can be stored in a models repository and either egressed out or served through APIs. In the non-parametric models where the whole data is required (as in K-Nearest Neighbours kind of algorithms), you can deploy the algorithm code as part of the pipeline itself.

This shows that just continuing to extend the data pipelines that existed so far into algorithms and models, is the only extra part to be done!!

The rest of the aspects of production-ready machine learning algorithms, consisting of authentication, monitoring, task orchestration, and infrastructure provisioning are all available out of the box from the stack here. None of these is explicitly depicted in the diagram because it is taken for granted on this stack.

You do not have to work anymore, with small data sets, only to find that when you deploy with larger data, the statistical performance has degraded!! Power unto you, power unto the data scientists and ML engineers - with all of the data, the processing power and the large memory.

Doesn't this sound liberating? It is indeed, though there are a few challenges and nuances one has to understand to make this work for your organization.

Machine Learning in a containerised world itself is a very empowering paradigm. Culling unforeseen insights and predictions have become a reality with the ushering in of ML.

Big data platforms like Hadoop have got the parallel processing capability of a distributed architecture to every enterprise - big or small, with the help of affordable commodity hardware. Combining the two opens up new vistas for any organization.

However, tread carefully on how you set up the two aspects of ML and big data together. Skillsets needed for the same cannot be underestimated. Upfront architectural thinking is a must. Understanding your company's use cases and the risk appetite, you would have to do a series of POCs for finalizing your custom set-up. However, the above article should give you a jump start on that thinking.

]]>You have various flavours of the same job description, called by different names. There are business analysts, data analysts, data scientists, data engineers, machine learning engineers, software engineers, and management consultants all vying for parts of the new pie.

Even as enterprises are creating new roles and job descriptions, I thought, I must share my take on what these roles mean, most often, as it stands in the industry now. Along with the roles, what is the kind of skillset each should possess to be successful at their job.

This can guide you in pursuing your career line or even for recruiters to understand the positions they are trying to fill up. Even industry will benefit by standardising on these roles, as we mature.

I have shown the skillset required by each in a coloured heatmap with the intensity depicting how deeply that skill is required by that role.

__How to read this graph? The darker the colour, the more important that skillset is for that role.__

You can see that there is a lot of overlap between many roles and hence the confusion in the minds of many, as to what skill they have to acquire for a role. Key points to note across the roles are:

- Almost all roles need to be aware of programming tools relevant to their roles. The actual stack varies from role to role, nevertheless. All except a management consultant need to be familiar with programming tools and that cannot be done away with.
- As you move from the left to the right, the skills required move downwards, indicating the shift from being client-facing to software development and operations management with the entire gamut of data skills in between.
- While this is how the skills are split into, in many companies currently, a new role in the offing, is called the
**Full Stack Data Scientist**which spans across many of these roles. That could be quite a coveted role but comes with its stretch and stress as well.

Now to look at the skillsets for each role:

The figure shows that a Management Consultant works closely with the clients and needs to have a complete understanding of the business. From a data perspective, he/she has to have a cursory appreciation and intuition about the associated data. They may not have to go too far beyond this.

This role is no different from typical management consultants and hence continues to remain the same, even in the new data world.

On the other hand, the data analyst may have a little lesser client interaction but has to be very comfortable with the understanding of the business, to derive insights. An analyst has to be able to do data cleaning, data manipulation, transformation and analysis.

Knowing statistics would help in doing better analysis. The most important aspect of the analysts' job is to derive insights through **visualization techniques** and be an extremely **good storyteller**. While communication is a key skill required for any role, the ability to weave a story out of the data is of paramount importance for this role. It is this ability to convey insights in a very compelling manner that helps in making data-driven business decisions. This is what can help in opening up hitherto unseen business opportunities.

Moving on to the data scientist, there are a whole host of skills required. Considering we are targeting the 'Data World', the majority of skillsets are required here. Understanding business and **data intuition, data wrangling, statistics, calculus**,** and linear algebra** are essential. With these as the basis, **Machine learning models** need to be developed. Basic data visualisation techniques are almost a prerequisite for getting a deep understanding of the data on hand.

Unless you plan to be a Full Stack Data Scientist, the best practices of software engineering, API development and the DevOps or MLops is not a must-have. Though, knowing a bit about them would be empowering.

From a role perspective, it is the data scientist who can come up with the right machine learning algorithms as solutions to certain problems. They can coerce the data to give the required insights, predictions etc. They typically do the model creation as well as define the metrics that measure the success of the problem they are solving. They also need to keenly monitor the statistical performance of the model and keep readjusting it through retraining or newer models.

A data engineer would essentially help in ingesting all types of data into a central data lake or some similar location and would help in productionising the ML Models or statistical models that a data scientist has produced.

Since, we are in the big data world, knowing to work with the **big data stack **is a must. **Data **wrangling**, data ingestion, data transformation, data validation** and overall ownership of data pipelines and sometimes, even the ML pipelines may be part of this role.

It is a must to know **software engineering practices** and DevOps too, in this role.

Traditionally, they would have built the data warehouses of the organization. However, now, they must have the skillset to deal with much beyond as the current data lakes can have structured, semi-structured and unstructured data. They collate all the data for the enterprise, clean, validate, deduplicate, do common transformations of the data and define standard practices of data ingestion and data egression from the data lake. They understand the nuances of various file formats, the storage and processing demands, the cataloguing of data, governance of data and maintaining the data lineage.

In some companies they don the hat of an ML Engineer too, where they productionise the machine learning algorithms, set up the data and ML pipelines and manage and monitor the performance of the algorithms and the models.

Some companies separate the data engineer role from an ML engineer role and that narrows down the portfolio of both of them. In such a case, ML engineer takes on from where the data engineer stops, in terms of productionising the models.

Once all the required data is made available, the pipelines for data preparation specific to the ML model are done by an ML Engineer. From there on, continuing to **productionise the** **models **and making the models available for predictions, insights, forecasts, real-time business actions like fraud prevention etc. are all owned by the ML Engineer. The **continuous deployment pipelines**, the building of the **feedback** mechanism, the **monitoring of the performance of the models** as time progresses are all automated here.

So, strong knowledge of **software engineering, programming tools, API development **and DevOps is a must. In fact, the specialised skill of DevOps for ML called **MLOps** is something that is needed here.

It is good to know about all of the data skills like statistics, calculus, linear algebra, data wrangling, as you can play a good collaborative role with data scientists as you productionise and monitor models. But these are not a must-have set of skills.

This is a traditional role that the industry is aware of, nothing specific to the data world. Hence none of the data skills is mandatory here. However, strong s**oftware engineering **and DevOps** skills **along with a wide range of programming tools and API plus App development skills are needed here. This is a well-known domain and hence I am not elaborating much here.

It is easy for a software engineer to transition into an ML engineer role with a few additional skillsets.

Another way of representing the same, for ease of reading, is shown here:

There is a lot of overlap between roles but there are also unique expectations of these roles. A very simplest representation of the overlaps between the roles is shown in the figure below:

The amount of overlap is not to scale but it is just indicative of the overlapping in skillsets. I have called out DevOps as an independent role here that can be a specialisation in itself but is very much required by other roles too, as shown above.

*Hope this gives you a bit more insight into the skillsets associated with roles and you are more empowered to choose the right career path or to choose the right candidate for your project.*

*Any thoughts, comments would be very welcome, to know how these roles pan out in your organisations*.* *

A warning that this assumes you have HBase working knowledge and are dealing with improving its performance.

Your design considerations are very specific to each use case. Therefore it is imperative that you have answers to most of these questions on your requirements:

- Is your use case a write-heavy use case or a read-heavy one?
- Is it dealing with a continuous inflow of data in near-real-time?
- Is it very large data that is gathering up very quickly?
- Is the data storage to be cleared automatically based on some time?
- Are you looking for a millisecond read response? or a write response?
- Do you have bulk writes happening at some times with strict SLAs on the write latency?
- Do you have point queries or range queries?
- Are you aware of all the access paths to your data?

Know that "Everything in Software Architecture is a tradeoff" and there is no right or wrong. It just works or doesn't as per the requirements ofyourstakeholders. That is all.

Having all of the above data (requirements) will help you deal with

- Schema Design
- Row key design
- Column family design

- Regions design
- Partition sizes
- Salting
- Number of regions per server

- Block cache versus MemStore ratio
- HBase Parameter fine-tuning
- Caching Strategies
- Compaction Strategies
- Data Locality
- Deleting or clearing data
- And some guidelines to watch out for

Each of these is elaborated in the rest of this article.

As part of the Schema Design, the two very important aspects that are to be decided upfront while creating the table are the row key and the column families. Let us see the design considerations for the same.

*(columns themselves are dynamic and do not require upfront design thinking )*

Your access paths or query paths are key in defining row keys. i.e. the columns that you will query by in your 'where' clause if you were to write a SQL statement, those are the columns that need to be in your row key.

Row key can be a single column or a composite of multiple columns. For example, if you fetch your data by *product id*, then that should be your row key. If you fetch your data by a *product id* and a *store id*, then the row key has to be a composite of these two columns.

The order of querying also determines the order of the columns in the row key.

Note that whatever you query by or filter by should be part of the row key. Else HBase has to scan all the regions of the table leading to a full table scan. That is extremely inefficient and hence should be avoided at all costs.

However, also note that if you create a row key that is a composite key of many columns, the purpose will be defeated and may lead to full table scans. You have to find a trade-off between the number of columns that form the row key and yet not scan the whole table.

**Your row key design should finally ensure that you do not end up scanning the whole table for any query of yours.**

If you run an **'Explain plan'** for your queries, you will see whether it is scanning the whole table - all the regions or is it doing a skip scan or range scan. This is possible when you use a SQL engine like Apache phoenix over HBase.

An example shown here:

```
#For this query:
SELECT PGDC, SUM(SFC) FROM LRF_PROMO_YRWKBPN_SPLIT WHERE (ID LIKE '201701%' OR ID LIKE '201702%' OR ID LIKE '201703%' OR ID LIKE '201704%') AND PGDC = 'H' GROUP BY PGDC;
```

The explain plan shows this:

```
| CLIENT 10-CHUNK 3946787 ROWS 3145728221 BYTES PARALLEL 10-WAY RANGE SCAN OVER DEV_RDF:LRF_PROMO_CVDIDX_YRWKBPN [0,'201701'] - [9,'201702'] |
| SERVER FILTER BY FA.PGDC = 'H' |
| SERVER AGGREGATE INTO DISTINCT ROWS BY [FA.PGDC] |
| CLIENT MERGE SORT |
```

This gives an insight that a range scan is happening over 10 chunks of data (out of the 64 regions it has) and the entire table is not being scanned.

*Note that HBase does not support secondary indexes unless you use a SQL engine like Apache Phoenix over it. The above example is using Apache Phoenix and secondary indexes.*

The read and write patterns determine the column family design.

Remember that all the data in one column family in a region is stored together in one HFile. So, if data from one column family is to be read, then the entire block of data in that HFile is loaded into memory, which includes all the columns in that column family. Hence store all the data that is retrieved together in the same column family. A similar argument holds good for updates or writes too.

However, this does not mean you have a huge number of column families separating columns for a huge number of use cases. Use as few column families as possible, typically restricting it to 2 or at the most 3. Having too many column families can cause many files to stay open per region and can trigger '**compaction storms**'. (*Compaction is a process where small files merge into a larger one. )*

Since flushing to hard disk and compactions are done on a per region basis, even if one column family carries the bulk of the data, it causes flushes on the adjacent column families too. Keep this too in mind when you are designing column families. This means you should not have one column family with say, hundreds of columns while another column family with just 5 to 10 columns. Then, when it is forced to flush the large column family from memory to hard disk, it flushes the small column families too.

By default, each table has only a single region. Suppose, we have 4 tables, each with one region and suppose we also have 4 region servers, each server will have 1 region per server or rather one table per server. This is the default behaviour.

This does not help in performance as every read and write to a table goes to the same single server. This is not utilizing the power of distributed processing. Therefore, based on the table profile and the data size, you would want to decide how many regions should be there per table.

HBase itself starts to split a table into multiple regions after a default size of a region is reached. Depending on the version of HBase you are using, this default size could vary. It is defined by *hbase.hregion.max.filesize* property in the *hbase-site.xml.* For 0.90.x version a default is 256 MB and max recommended is 4 GB. In the HBase 1.x version, this has increased to 10 GB.

But relying on auto-sharding leads to performance degradation during the auto-splitting and rebalancing period.

Hence the best practice here is to decide the partitions upfront if you are aware of the range of the row key used. Create the region splits upfront during the table creation itself.

`CREATE 'tableName', 'cfName' SPLITS => ['10', '20', '30', '40']`

This gives the boundaries on the key where the regions have to be split.

Recollect that regions are equivalent to horizontal partitions or shards in HBase. Just as partitioning should ensure that data is not skewed in a single partition or a few partitions, the distribution of data between the regions should be even. This leads to a lot of efficiencies.

In the *CREATE tableName s*tatement above, if the data between *10 and 20* is very less and a lot more between *20 and 30*, then change the split boundaries to make these two regions evenly distributed. May be something like *10, 25, 30* might turn out to be the right split.

If the regions are evenly spread, the IO is balanced between the servers. The writes into MemStore is balanced and the flushes to the disk are also balanced. Similarly, the reads are also spread out across the region servers thus using the cluster capacity effectively.

Note that if the reads distributed, cache evictions will be optimal too.

HBase sequential writes may suffer from region server hot-spotting if the row key is a monotonically increasing value. This can be overcome by salting. A detailed description of this solution is given __here__

If you are using Apache Phoenix, the salted table would be created like this

`CREATE TABLE table (a_key VARCHAR PRIMARY KEY, a_col VARCHAR) SALT_BUCKETS = 20;`

This ensures that the sequences are not stored in the same region server and hence hot-spotting is avoided.

For example, if the current date is part of the row key and all the data coming in is with the current date, only one region with this row key would get all the traffic. To avoid this, salt your row key and then it gets distributed to different regions.

However, salting has its own disadvantages when you want to retrieve data, esp, a data range based on a row key.

If your table has no hot-spotting issues during write, then avoid salting altogether.

If you are trying to retrieve a contiguous range, without salting, you would have hit just one or a couple of regions and retrieved all the required data. This is because the data remains in a sorted order across regions and within regions. But with salting, the sorted order is not maintained across regions and you end up scanning all the regions for getting a range of row keys.

So, use salting only if necessary to avoid hot-spotting during writes. Hotspotting would become a problem only if you have a large inflow of continuous data into your servers, which beats the writing capacity of your server.

Keeping the regions sorted by row key will ensure it avoids unnecessary scans across regions for a specific range of data.

How many regions can you have in one region server? This depends on multiple factors like how big are your regions, how fast are you writing into your regions, whether is it a continuous flow of data or bulk data writes, how much are you reading back from the regions and what is the memory available on your region server.

There are many heuristics around this, all of which need to be taken with a pinch of salt and need to be tried out for your specific use case.

If you are a write-heavy use case with data flowing continuously, then you probably want to minimize the number of regions to 1 or a few per region server so that you are not filling your memstore too soon and flushing down to the disk too often.

However, if yours is a bulk write use case, you want to increase the parallelism with which you write. Hence more regions or partitions would increase the parallelism. This may dictate sometimes, that you have more regions per server. To explain this with an example, if your HBase cluster is a 16 nodes cluster, and you want to write same 500 GB to 1 TB of data in less than 10minutes, you would want high levels of parallelism in writing. If you go with 1 region per server, you will have 60 regions for this table and only 16 parallel writes are possible. To increase the oparallelismf writing you would make the table have 32 regions of 64 regions, leading to 4 regions per server. The write time for 64 regions is lower than write time for 16 regions, in he case of bulk writes.

However, a word of caution on making the regions too small such that you have very little data in each region. This will cause the reads to underperform as many regions may have to be scanned to get the data of interest.

The common heuristic that I have read is about 100 to 200 regions per server, but practically what I have seen is that the performance is manageable up to 100 regions and beyond that, it starts deteriorating.

We have seen that HBase caches read data in Block cache and write data in MemStore. Since HBase uses the JVM heap, by default, 40% of the heap memory is allotted to block cache and 40% to memstore while the rest is used for its own execution purposes.

This can be fine-tuned to improve either your read or write performance.

**Read Heavy use case:** To get very good read performance, we all know caching is the key. The more the read cache the better it is for read performance.

You would ideally want all the data in a region to be available in the memory of the region server, so that the region servers do as few disk reads as possible. This can be achieved by increasing the block cache ratio to 60% or 70% knowing that you are compromising the write cache and hence the write performance.

**Write heavy use case:** This is like a corollary to the previous use case. If you increase the ratio of memstore, you would improve write performance at the cost of read performance.

For example, if your JVM is having a heap memory of 32 GB, by default you have 12.8 GB as block cache and 12.8 GB as MemStore. So, if the data that is retrieved often is within this size, it will be cached and efficiently served. Any data that is not in cache will have to be read from the disk leading to slightly slower responses. This too can be mitigated to an extent by using SSD hard drives to quicken the read time from disks.

However, if you have many regions per server each with 10 GB of data, then this cache is too small to make a difference. Then, you have the bucket cache that can help, which is discussed later.

There are a whole host of parameters that can be tuned and can be a deep subject in itself. However, I would like to call out a couple of them that have helped in my explorations.

- While creating a table, you could make it an IN_MEMORY table by setting this parameter to TRUE. Then, the cache gives importance to keep this table's data in memory, on a priority basis. This, however, should not be used if bucket cache is enabled.
- A way to warm block cache, set the below parameter:

`PREFETCH_BLOCKS_ON_OPEN => 'true' `

The purpose is to warm the BlockCache as rapidly as possible after the cache is opened, using in-memory table data, and not counting the prefetching as cache misses. This is great for fast reads, but is not a good idea if the data to be preloaded cannot not fit into the BlockCache.

We saw that by default, data gets cached in block cache for improving read performance. Generally, the data in one region can be about 10GB. And you have many regions per region server. So, if you have 10 regions in one region server, that would be 100 GB of data. However, due to GC limitations of the JVM heap, most often you are able to give a max of 32 GB to the heap memory and hence only 12.3 GB to block cache.

In this case, block cache would not be sufficient for 100 GB of data. To over come this limitation, bucket cache could be used. The offheap implementation of bucket cache allows for using memory offheap.

If the BucketCache is enabled, it stores data blocks off-heap, leaving the on-heap cache free for storing indexes and Bloom filters. The physical location of the BucketCache storage can be either in memory (off-heap) or in a file stored in a fast disk.

Hence if you are using servers with large RAMs as the nodes of your HBase cluster, you could utilize the RAM beyond 32 GB as part of the bucket cache and could get your entire data almost into memory. This would enhance your read performance significantly.

Monitoring your cache hits and cache misses through tools like grafana could go a long way in fine-tuning the cache usage.

In an ideal scenario, having large Cache, almos as much as the data you have on disk would give you the best performance. This is in case all your data could be queried equally and there is no small subset that is more often queried.

HBase has concepts of minor and major compactions. There are multiple configurations that define when they get triggered. Typically the number of small files defines the trigger point of compaction.

`hbase.hstore.compactionThreshold `

gives the number of files that should trigger minor compaction.

Major compaction is slightly more involved and requires more resources. It not only merges all the small files into the large file but also does version data cleaning, deleted data cleaning etc and hence is very resource-intensive. Need to carefully ensure it triggers in non-peak hours of your cluster usage.

Data is spread across region servers. The regions served by a region server (as assigned by HMaster) should ideally be located locally in that server. This is the best-case scenario and provides the best performance. The percentage of data local to that server is termed **data locality**.

Strategies must be used to ensure close to 100% data locality. Whenever a large chunk of data is written, this locality percentage can go down drastically. Hence doing major compaction after a bulk write is almost always necessary to improve data locality.

If data is flowing in all through the day, then the locality could be slowly deteriorating and it should be fixed through major compactions at appropriate intervals.

Data locality cannot be underestimated in how much it improves the read performance.

It is imperative to to keep only the data that you are going to need, in HBase. Trying to archive all data in HBase for historical reasons is not advisable. Hadoop file system can be used for that purpose.

Two simple ways of ensuring that data does not keep on growing in HBase are:

- Setting the time-to-live (TTL) on your table or column family
- Setting the number of versions of a data you want to store

For setting TTL at a column family level, you could do it this way at HBase shell:

`alter ‘tableName′, NAME => ‘cfname′, TTL => 3000`

The TTL value is in seconds, after which the data will be deleted automatically. This ensures that the data is no more available for querying, thus cleaning up the table of old data that you do not want to keep.

The second way is to set the number of versions of a data you want to keep. This does not delete the data fully but ensures that the underneath versions are not growing indefinitely, making the data size larger and larger. This just ensures older updates to the data are cleaned up.

By default HBase maintains 3 versions of updates to a data. The way you can set the number of versions at a column family level is

`create 'tableName', {NAME =>'cfname', VERSIONS => 4}`

This ensures only latest 4 versions of the data is keps in the column family *cfname*.

- HBase does not support joins. So, you have to denormalize and store data. Else, you will be forced to fetch the data into the client and join it there, which is certainly something to be avoided at all costs.
- Keep your column family names as short as possible as they are prefixed to every column and over a huge data, can add humungous size to your data, reducing read and write efficiencies. Similalry keep the column names too as short as possible, though nothing limits you from keeping long names.
- Column Families have to be defined upfront. They cannot be added later. If that is required, an entire process of data migration has to be planned.
- Use "query-first" schema design. i.e. focus on how data is read. data that is accessed together should be stored together in a single column family, just to reiterate.

- Aim to have regions between 10 and 50GB size
- Aim to have cells no larger than 10 MB. Otherwise store data on HDFS and a pointer to that data in HBase
- Use Bloom filters to improve read performance. these could be enabled for rows or column reads.
- Use compression like snappy on the column families to improve your data storage and performance. Keep in mind, that only when cores available are very low, compression could slow down the data retrieval performance, as it is a compute intensive operation to decompress.

This article covers just a few perspectives of best practices and guidelines. I am sure there are a whole host of other ways of fine-tuning and getting the best out of HBase. Would like to hear if any of you have any more that can be added to this.

Wishing you a great time squeezing performance out of HBase!!

]]>*We all agree that software architecture is a sort of a plan of a system the helps in understanding the initial decisions taken, the trade-offs made and gives an idea about how the systems were intended to be used and thus equips the user of the software to use it appropriately. It makes it easier to understand the whole system and makes the decision-making process efficient.*

With this as the background, let us understand HBase architecture which will certainly equip us to use HBase correctly and for the use cases it was meant to be used for.

One of the important aspects of a distributed database is to provide high availability through robust failover mechanisms. They are designed to run on commodity hardware. So failures are expected and the architecture takes care of dealing with these failures at both the storage and the compute levels, almost seamlessly.

HBase uses HDFS as the storage layer.

HDFS itself is a distributed file system with a default replication factor of 3, thus providing storage level resilience.

The HBase data is stored in what is termed as **HFiles** on HDFS. Further, HFiles are split into blocks. So, if one node or disk that is serving some data goes down, there are 2 other copies of the same data and they become available. Hence data availability is assured.

Let us now see how compute-availability is assured in the HBase architecture

A typical Cluster Architecture will have 2 Master servers called **HMasters**, in an active-passive configuration. At any point only 1 is active.

You have a cluster of 3 zookeepers that help in coordinating between the various servers and provide centralized configuration management. They also maintain the health of the cluster. The zookeepers keep a session-based connection with each region server, which act as the data nodes. The servers send heartbeats so that the zookeeper knows that they are all still running. The master keeps listening to the zookeeper notifications to know which worker nodes are healthy. If one of the worker nodes go down, the master will assign its job to another worker node. This way the clients will see an almost seamless failover.

How is the availability of the master itself achieved? The passive master keeps listening to the zookeeper notifications and if the active master goes down, it will take over and become the active HMaster. Thus the master availability is also ensured.

The master's responsibility is to manage schema changes, cluster management and data administration.

Now let us dwell a little deeper into the concept of Regions, Region Servers and how they manage HFiles.

HBase has a concept of Regions served by Region servers. Let us understand this a little more. By now we know that HBase stores its data in the HFile format. All these are closely related and let us see how.

HBase distributes the data horizontally partitioning it into what are called **regions**. It implies that every HBase table has one or more regions.

The nodes in an HBase cluster are called **region servers**. The partitioned data or '**regions**' are distributed onto these region servers. Note that one region server can have many regions on it.

Each table has one or more regions depending on how many horizontal partitions have been created. *There are various ways to decide on the partition strategy. However, if none is defined, there is an auto sharding that happens by HBase as the data grows. *

Each of the regions is accessible only through one region server and there might be many regions on each region server. One of the region servers is made 'in charge' of a region and all the read and write is directed only through that region server as shown in the figure.

Since it is horizontal partitioning, the splits are done based on sorted row keys. Each region has a start key and an end key to define the regions' limits. The rows within a region too are ordered by row keys. This is shown in the example figure below.

Also, all the data in one region is not in one HFile. There is one HFile created for every column family. Therefore, if there are three regions and two column families, you will end up have 3 * 2 = 6 HFiles on the disk.

All of the above theory can be understood better with the example in the figure above.

Here you have two regions, region 1 and 2. The table has a rowkey of *year*, *a unique ID, probably a product id and a date*. It has two column families named '*Input*' and '*Forecast*'.

Note that Region 1 has start key as *2005... *and end key as *2006*... Region 2 has start key as *2007... *and end key as *2008*... Everything in between these two keys belongs to that region. Within this region, the *Input* column family is stored as one HFile and the *Forecast* column family is stored as another HFile.

This is how data is distributed in HBase using HFiles, regions and region servers.

We know by now that rows in an HFile are sorted by rowkey. The HFile has an index (which is B+ tree-based). This index helps in the efficient retrieval of the rows.

These indexes are loaded into memory by region servers as shown here. So, when you read data, the region server scans the index in memory to find which block to read. Knowing which block to read, HBase directly accesses only that block on the disc and retrieves the required data.

Therefore, the lookup happens in memory and the disk reads are minimised to that extent. Within the block, the row keys are sorted hence easily retrieved (*probably with the time complexity of a binary search)*.

The above is the case of a first time read. However, if data has already been read once, HBase has the concept of a **block cache** where it caches the data that has already been read. We will understand this a little more after we look at the HBase writes as well, since we would also want to know how HBase reads modified/updated data.

Consider that you are now inserting a record into HBase. The Region server that is in charge of the row key that you are inserting has a write buffer in memory called MemStore.

So, it first writes into the MemStore. While it writes into MemStore, it is sorted and written. When MemStore fills up, it flushes that sorted data onto the disk. This flush happens very fast as the write to disk now is just sequential writes.

In the example given in the figure, the new row key is *0011*, which gets inserted between *0000* and *0123*, in memory. The sorting and insertion happens in the MemStore of that region. Only when Memstore is full, it flushes it to the disk. This is again the happy path of how HBase writes data.

Now once this data is inserted in memory, how does the read work? In the earlier section, we said it will read from disk based on the block that the data exists in. Now, this data is not yet flushed to the disk. How will HBase read this?

HBase has a read cache called the **Block Cache** where frequently read data is kept, in memory. We just saw that HBase has a write cache too called the **MemStore**. There is data on the disk too.

So, how does HBase ensure that the data is read from all of these sources? Block cache, MEmStore and hard disk. HBase follows this route of reading the Block cache first, the memstore next to see if any updates have happened and finally it reads the disk. If it finds the rowkey in the Block cache, it checks the MemStore to see if there are any updates. If, yes, it merges the two and returns the data. If it is neither found in block cache nor MemStore, it looks up the data on the disk.

We saw that HBase flushes the data to disk every time the MemStore is full. So, small files get written to the disk. HBase has the process of compaction that are triggered to merge these small files into large files at regular intervals, based on certain configuration parameters. These are called **Minor** and **Major compactions**.

The minor compactions merge all small files into one without modifying the main HFile of the region that already existed. The Major compaction merges this one compacted file from minor compaction with the larger HFile that already exists.

These compactions help in read performance immensely.

HBase actually has the concept of a **Write Ahead Log (WAL) **into which it writes even as it writes into the **MemStore**. Its write is not complete till it writes in both these destinations. Since the WAL is sequential, it is pretty fast.

Therefore, in the case of a region server going down before the data is flushed to the disk, the data is reconstructed from the WAL file. When a new region server comes up to take its place (as elected by HMaster), it will check if the regions that it has to take care of, are all up to date with the WAL file and if not, update its state before making itself available to clients.

This takes a little time during which these regions become temporarily unavailable. This is why HBase is said to give more importance to **Consistency over Availability. **Once the newly elected Region Server is ready to serve, that region is consistent and available. At* this time, *the *rest of the regions and region servers are all continuing to serve the clients and hence the impact is only on those regions that were served by the node that went down.*

Having understood the HBase architecture of Partitioning or Sharding, regions, region servers, HFiles and the read and write path of HBase, you are in a much better position to work with HBase and make the right decisions on HBase design for your Use Case.

We still need to see some best practices for HBase table design and certain heuristics that can act as some guidance if you are starting on HBase. This will be covered in the next article.

]]>