When can you use Linear Regression?
It's been a while since my last post, as I was caught up with a couple of talking engagements - one at a university for engineering students and another at a data conference. Both have been very enriching experiences.
However, coming back to the topic on hand, all on the Machine Learning Journey start with learning about Linear regression - almost all who are serious learners :) Initially, it seems too simple to be of use for predictions. But as you learn more and more, you do realise that it can be a solution for a good set of problems. However, can you use Linear regression for any problem on hand? or do you have a set of constraints that you need to be aware of, so that you use it in correct scenarios?
If you have read my articles so far on Linear regression, starting from
Going all the way through the various concepts used in the above articles, individually as part of these articles
there are hardly any assumptions mentioned about linear regression perse.
The only assumption, if at all, that too very implicitly is that there must be a linear relationship between the target and the independent variables. And that is the reason we are able to express that relationship as in the given equation here:
where the Xs are the independent variables and the Y is the dependent variable. The betas are what the model comes up with for the given data, of course with the epsilon as the mean zero error or the residual term.
However, this is not the only assumption that is true in the case of linear regression, there are other assumptions too to make the inferences of a model reliable. This is because we are still creating a model from a sample and then trying to use that model for a general population. This implies that we are uncertain about the characteristics of the larger population and that needs to be quantified. Hence we need to have a few assumptions about the data distribution itself.
If any of these assumptions do not turn out to be true with the data that you are working on, then the predictions from the same would also be less reliable or even completely wrong.
There are 4 assumptions that need to hold good including the one already stated. They are
A Linear relationship exists between Xs and Y
The error terms are normally distributed
The error terms have a constant variance ( or standard deviation) This is known as homoscedasticity
Error terms are independent of each other
Clearly, there are no assumptions about the individual distributions of X and Y themselves. They do not have to be normal or Gaussian distributions at all.
Let us understand the assumptions. The first one is obvious.
What does the second assumption mean?
Error terms are normally distributed
When we are fitting a straight line for Y vs X, there can be a whole host of Y values for every X. However, we take the one that best fits the line. The actual point may not be on the line and that gives us the residual or error.
In the figure below, the e65 is the error at x = 65. e90 is the error at x=90.
Therefore, the Y at x = 65 would be
and Y has an error e65.
This error itself can be anything. But considering that we want e (epsilon) to be a mean zero error, we will be fitting the line in such a way that the errors are equally distributed either positively or negatively around the line. That is what would be deemed the best fit line.
Since in linear regression, the data points should ideally be equally distributed around the best fit line, to ensure that the mean residual is zero, this makes the distribution of errors a normal distribution.
If you plot the residuals of your sample data, this is the kind of graph you should get.
This is the second assumption
Error terms have a constant variance (Homoscedasticity)
This is the 3rd assumption. This is also known as homoscedasticity. The errors have a constant variance (sigma-squared) or a constant standard deviation (sigma) across the entire range of the X values.
This is to say that the error terms are distributed with the same normal distribution characteristics (defined by mean, standard deviation and variance) through the data range.
See the patterns of residual plots in the above figure, the plot (a) shows no specific pattern in the residuals implying that the variance is constant. In such a case, linear regression is the right model to use. In other words, it means that all the possible relationships have been captured by the linear model and only the randomness is left behind.
In plot (b) you see that the variance is increasing as the samples progress, violating the assumption that the variance is a constant. Then, linear regression is not suitable in this case. In other words, this means that the linear model has not been able to explain some pattern that is still evident in the data.
If the data is heteroscedastic, then it means that
Error terms are independent of each other
This is the 4th assumption that the error terms are not dependent on each other and have no pattern in themselves if plotted out. If there is a dependency, it would mean that you
have not been able to capture the complete relationship between Xs and Y through a linear equation. There is some more pattern that is visible in the error.
Getting a residual plot like this shows that the variance is constant as well as the fact that the error terms are independent of each other.
These assumptions are necessary to be tested against the predicted values by any linear model if you want to ensure reliable inferences.
The meaning of these assumptions is - what is left behind (epsilon/error) that is not explained by the model is just white noise. No matter what value of Xs you fit in, the error's variance (sigma-square) remains the same. For this to be true, the errors should be normally distributed and have a constant variance, with non-dependence on each other.
This also implies that the data on hand is IID data or Independent and Identically distributed data, that is suitable for linear regression.
In layman terms, all these assumptions go to say that the dependent data has a truly linear relationship with the independent variable(s) and hence it is explainable with a linear model. We are ensuring that we are not force-fitting a linear model on something that is not linearly related and that probably there exists a relationship that is either exponential, logarithmic or some relationship explained by higher-order equations.
Hence, you need to test for each of these assumptions when you build your linear models to use the inferences with confidence.