Updated: Jul 15, 2021
Multicollinearity is a concept relevant to all the input data that is used in a Machine learning Algorithm. This has to be understood well, to prepare the input data for many Machine Learning Algorithms.
Let us understand what it means, why should we be interested in it and what action do we need to take about it.
What is MultiCollinearity?
Starting from first principles, Wikipedia says "Collinearity is a linear association between two explanatory variables. Two variables are said to be perfectly collinear if there is an exact linear relationship between them"
Extending that, multicollinearity is said to exist between two or more independent variables when they are linearly related i.e. they are not completely independent. This implies that one of the independent variables can be predicted to some extent by another independent variable.
To visually understand this, let us look at how the linear relationship between two variables can be seen in a scatter plot:
Here you can see we have three variables (Views_show, Ad_impression, Views_platform) that are used to determine the popularity of a show. The number of views of the show (Views_show) are somewhat linearly related to the number of ads that have been clicked through (Ad_impression) and the number of views per platform (Views_platform).
In fact, if you plot a heatmap you will notice that the Views_show has a linear correlation coefficient of 0.79 with Ad_impression and 0.6 with Views_platform. In contrast, it has hardly any linear relationship with day (-0.039)
Hence we say there is collinearity between Views_shows and Ad-impression. Similarly between Views_show and Views_platform.
And since one variable has some linear relationship between more than one other variable, we call it MultiCollinearity.
Implications of MultiCollinearity & Does it need to be fixed?
When the independent variables are actually dependent on each other, it becomes very difficult to predict which has what kind of impact on the output variable. Let us understand what this statement means.
Say variables x and y are the same. Then, 3x + 7y can be taken as 10x or 10y, it does not matter. The equation can be, say, z = 3x +7y or z = 2x +8y, z value will remain the same. So, it becomes very difficult to attribute the right coefficient to x and y. You can play with a whole host of coefficients 3,7; 2,8; 1,9; 4,6; 5,5 and so on and z will remain the same.
Hence the coefficients of an equation can vary wildly if there are dependent variables within the equation. How does that matter? It matters when you want to use the strength of the coefficients to interpret the predictor equation.
Typically in linear regression, if you have an equation like this:
z = 3x +7y
(example can be sales = 3 * Newspaper ad spend + 7 * TV ad spend)
It means that z(sales) increases 3 times with every unit of x (spend on newspaper Ads) if y (spend on TV ads) remains constant. Similarly, z(sales) increases 7 times for every unit change in y (spend on TV ads), when x(spend on Newspaper Ads) remains constant. Now, this interpretability is lost if these coefficients are not stable and start varying wildly. You just cannot interpret the influence of x and y on z and may end up spending more of your ad spend on the wrong channel.
Therefore, whenever you want reliable inferences - which means understanding the strength of a variable based on its coefficients, you would have to remove inter-dependent predictor variables and keep only truly independent predictors.
Note that the z does not change and hence the predictions themselves do not have an impact if there are multicollinear variables. But it has a major impact on the coefficients and hence the inference drawn.
So multicollinearity needs to be removed only if you are looking for interpretability of the findings. If you are looking only for precise predictions, then there is no necessity to detect multicollinearity or take any action.
How do you Detect MultiCollinearity?
Collinearity between variables can be visualised using scatter plots, just as shown in the example above. But in case we have many variables and we want to find whether there is a relationship between every variable and every other variable, there should be a better way of detecting this relationship. The combinations can turn out to be overwhelming if we have to find out through visual plots or heatmaps.
One of the ways is to calculate the Variance Inflation Factor (VIF) for all predictors. What VIF does is that it tries to predict one of the predictors using the rest of the predictors. This is repeated for all the variables available.
So, the concept used here is that if one of the predictors can be predicted by one or more of the other predictors there is a strong association between them. Hence that predictor can be removed.
The formula for VIF is
Where R is the linear correlation coefficient or known as the goodness-of-fit measure in linear regression models. It basically measures the strength of the relationship between the model and the dependent variables on a 0 to 100% scale. So, every ith variable is taken and seen if it can be predicted using the rest of the predictors. If the VIF value is large, it means that the variable has a large correlation with the rest of the variables. So, now that you may have detected the existence of Multicollinearity, what do you do next?
What do you do if you detect MultiCollinearity?
Once we know that one predictor is correlated to many others, it may not add any value to the model itself. So, it would be good to remove that variable.
The heuristic typically followed here is that if the VIF value is great than 10, it means it has a huge correlation with other variables and can be certainly dropped. If it is greater than 5, you examine and decide with the help of other stats like p-value and then decide to drop.
Some consider anything above 2 is not acceptable and needs to be dropped, though this is considered too stringent a criteria.
This is an iterative process where you calculate VIF repeatedly, every time you drop a variable, as the VIF of the rest of the variables could change considerably after you drop one variable.
This iterative process can help you remove all MultiCollinear variables and that should make the coefficients of the equation stable.
MultiCollinearity is of concern when we want to be able to draw inferences. it does not impact the precision or accuracy of the predictions itself. Only you are looking for interpretability, you must eliminate multicollinearity by checking the VIF factors for each of the predictors iteratively and dropping those that have a high VIF value.