Linear Regression Through Code - Part 2
Last week, in Part 1, I walked through all the preliminary steps to be done before you can build a Linear Regression model. This week we will build the model itself.
A brief refresher about the problem I had taken you through: We have data of a Bike Sharing System in a particular city and we want to use Multi-Linear Regression to make predictions and also derive insights into what are the factors that impact the number of bikes hired.
The complete data and code are available here: https://github.com/saigeethamn/DataScience-LinearRegression
With this in the background, get ready for a bit of a roller-coaster ride as there are many new aspects to be understood and dealt with while building a model. I have tried to briefly explain them here. But to know it in detail, many more posts are required and I will certainly explain as I keep posting.
In this post, we will look at the process of selecting the most significant features using some techniques like Recursive Feature Elimination (RFE) and concepts like Variable Inflation Factor (VIF) and p-value. We will then look at the model building itself - which you will realise - is a fairly simple step, considering we will use python libraries that do it in a jiffy.
Once we have the model, we need to validate that the assumptions of Linear regression are satisfied by the model. Only then it is valid to use Linear regression for the data on hand. If it is indeed a valid model, then we apply the model to Test data and make predictions.
The beauty of Linear Regression is that it lends itself to explainability. What are the main features that impact the target variable and by what amount? This helps businesses to concentrate efforts on maximising or minimising those effects to improve bike hiring over time. Investments can be made in the right areas to grow the business.
Build the Model
The data on hand has 30 features. We certainly want only those features that impact the target variable significantly. This requires us to eliminate the not so significant features. There are multiple ways to do it and a topic in itself.
For today it will suffice to say that the various methodologies used to eliminate features are:
Manual feature elimination - tedious and not practical most often
Forward selection
Backward selection
Stepwise selection
Automatic feature selection
Recursive feature elimination - an automated way to get top 'n' features
Lasso regularization
Balanced Approach - coarse tune using automatic feature selection and the fine-tune using manual feature selection
And I go with a balanced approach - a combination of RFE and backward selection to narrow down the features of interest.
There are a few other concepts to understand but I will only briefly mention them here.
The need to reduce the number of features comes from various needs. Too many features can lead to:
Overfitting - this happens when the algorithm almost memories the data and hence the train data performs very well but test data performs very poor
MultiCollinearity - This implies that a variable is related to one or more of the other variables in the data. All the data are not completely independent of each other and hence lead to problems with interpretation and inference.
There are many ways to detect this and deal with both these issues.
Some methods that help in dealing with multicollinearity are:
Dropping variables - dropping those that are highly correlated and keeping only those that are of valuable interpretation to your business
Creating a new variable - that encompasses many other which are removed
Variable transformations using Principal component analysis
Here I go with dropping the highly correlated variables.
But how do we first understand which are the variables that are highly correlated and probably redundant?
This brings in the concept of Variance Inflation Factor (VIF). VIF helps to understand the variables that are highly correlated. Once you find them, you can and drop them. (VIF tries to build a model trying to predict one feature use all the other features. If it can be predicted well by the others, then it is highly correlated. This gives a high VIF score indicating it can be dropped )
With much of the theory (though very high level) out of the way, we'll get into looking at the implementation
Feature Elimination through RFE And Model Building
Here, we use the LinearRegression model from sklearn.linear_model and RFE from sklearn.feature_selection module, considering that we already have our train data split into the variables X_train and y_train.
RFE helps is get the top n features and here I am narrowing it down to the top 15 features.
lm = LinearRegression()
lm.fit(X_train,y_train)
rfe = RFE(lm,15)
rfe.fit(X_train,y_train)
The top 15 features get a rank of 1 and then the rest of the features are ranked with subsequent numbers. You could list the columns and their ranking as shown here:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))
The output looks like this:
[('holiday', True, 1),
('workingday', True, 1),
('temp', True, 1),
('atemp', False, 6),
('hum', True, 1),
('windspeed', True, 1),
('spring', True, 1),
('summer', True, 1),
('winter', True, 1),
('aug', False, 8),
('dec', False, 3),
('feb', False, 4),
('jan', True, 1),
('jul', True, 1),
('jun', False, 13),
('mar', False, 15),
('may', False, 7),
('nov', False, 2),
('oct', False, 14),
('sep', True, 1),
('monday', False, 9),
('saturday', True, 1),
('sunday', False, 5),
('thursday', False, 11),
('tuesday', False, 10),
('wednesday', False, 12),
('light', True, 1),
('mist', True, 1),
('2019', True, 1)]
Now you use only these 15 features to build your next linear model using OLS (Ordinary Least Squares) from statsmodel.api module.
You use OLS from statsmodel (instead of LinearRegression from sklearn) as it gives you a good summary of many measures as well as the p-values that you need in subsequent steps for further manual elimination of features.
col = X_train.columns[rfe.support_]
X_train_rfe = X_train[col]
X_train_lm = sm.add_constant(X_train_rfe)
lm_1 = sm.OLS(y_train,X_train_lm).fit()
print(lm_1.summary())
And with this, you have created the first iteration of your OLS model with 15 features. In line 1 and 2, I am filtering out only the 15 columns of interest and ensuring I pick only those from X_train. In the 3rd line, I have explicitly added a constant because OLS assumes that the line that fits the data, always passes through the origin.
Next line I call sm.OLS and fit the y_train, X_train data. This is the magical line that generates the coefficients for the line that passes through these 15-dimensions of data. And there lm_1 is my first Linear Regression Model.
Next, you can call for the summary of the model and check some of the statistics like R-squared and Adjusted R-squared to understand how good a model you have got. May be along with AIC and BIC scores too. (Not getting into explaining any of these here, can look forward to more posts on these.) It may be seen that the scores are decent with r-squared being 0.848 which implies that 84% of the data is explained by this model.
Apart from this you also want to check if all features are significantly contributing to the model. This is the summary displayed:
OLS Regression Results
=================================================================
Dep. Variable: cnt R-squared: 0.848
Model: OLS Adj. R-squared: 0.844
Method: Least Squares F-statistic: 184.0
Date: Mon, 08 Jun 2020 Prob (F-statistic): 4.22e-191
Time: 15:35:43 Log-Likelihood: 519.65
No. Observations: 510 AIC: -1007.
Df Residuals: 494 BIC: -939.5
Df Model: 15
Covariance Type: nonrobust
=================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------
const 0.2519 0.036 6.969 0.000 0.181 0.323
holiday -0.0582 0.027 -2.164 0.031 -0.111 -0.005
workingday 0.0433 0.012 3.762 0.000 0.021 0.066
temp 0.5096 0.034 14.837 0.000 0.442 0.577
hum -0.1563 0.037 -4.188 0.000 -0.230 -0.083
windspeed -0.1863 0.025 -7.310 0.000 -0.236 -0.136
spring -0.0509 0.021 -2.464 0.014 -0.091 -0.010
summer 0.0508 0.015 3.423 0.001 0.022 0.080
winter 0.0933 0.017 5.403 0.000 0.059 0.127
jan -0.0345 0.017 -1.989 0.047 -0.069 -0.000
jul -0.0529 0.018 -2.931 0.004 -0.088 -0.017
sep 0.0814 0.016 4.945 0.000 0.049 0.114
saturday 0.0536 0.014 3.694 0.000 0.025 0.082
light -0.2475 0.026 -9.507 0.000 -0.299 -0.196
mist -0.0563 0.010 -5.439 0.000 -0.077 -0.036
2019 0.2305 0.008 28.795 0.000 0.215 0.246
=================================================================
Omnibus: 66.260 Durbin-Watson: 2.080
Prob(Omnibus): 0.000 Jarque-Bera (JB): 159.826
Skew: -0.678 Prob(JB): 1.97e-35
Kurtosis: 5.383 Cond. No. 22.1
=================================================================
If the p-value is greater than 0.001, you know it is insignificant and hence that variable can be dropped.
Here I notice that the p-value of 'jan' is 0.047, pretty insignificant and hence this can be dropped.
You always check the p-value and the VIF value. However, even if the p-value is < 0.001, but VIF is > 10, you know it is correlated to other variables and hence you drop it.
So, lets check the VIF values too:
# Calculate the VIFs for the model
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif
This is the output:
Features VIF
3 hum 29.37
2 temp 17.78
1 workingday 5.31
4 windspeed 4.73
5 spring 4.53
7 winter 3.46
6 summer 2.85
13 mist 2.29
14 2019 2.09
11 saturday 1.98
8 jan 1.67
9 jul 1.59
10 sep 1.39
12 light 1.24
0 holiday 1.18
Note here that even though humidity has a very high VIF value, first, I would be comfortable eliminating 'jan' as a variable as it is not contributing significantly to my model with a p-value of 0.047
Once I drop 'jan', I repeat the above two steps and check on p-values and VIFs. You can check out the iterations in the full code. Through the next few iterations, I eliminate 'holiday', 'spring', 'jul' as they show up as insignificant features. Later, 'hum' is eliminated as it has a high VIF value despite a low p-value indicating that it shows multi-collinear behaviour.
With this, we arrive at 10 features all of which are significant and are not correlated to each other. This last model with the 10 variables is the final model that is what we will go with. Therefore, the 'lm_6' model in the code has the entire final model.
Is it really THE model? To answer this, we need to verify by validating whether the model lives up to all the assumptions of linear regression.
Validate the Assumptions of Linear Regression
What are the assumptions of LR?
There is a Linear relationship between target and predictors/features
Error terms are Independent of each other
Error terms are Normally distributed
Error terms have constant/Equal variance (a.k.a. homoscedasticity)
Note the acronym L.I.N.E from the above for easy recollection of these assumptions.
We can do Residual Analysis to validate some of these assumptions on the errors
Residual Analysis
You use the model to get the predicted y for the train data and then you get the difference between the predicted and actual, as the errors/residuals.
y_train_cnt = lm_6.predict(X_train_lm)
residual = y_train - y_train_cnt
res_df = pd.DataFrame({'index':residual.index, 'res':residual.values})
plt.plot(res_df.index, res_df.res, 'bo')
plt.xlabel('Residual Index')
plt.ylabel('Residual')
plt.show()
You then, plot the residuals and see that they are randomly scattered with no inherent pattern. This process that the residuals are independent of each other.
Now you plot the distribution plot of the errors:
# Plot the histogram of the error terms
fig = plt.figure()
residual = y_train - y_train_cnt
sns.distplot(residual, bins = 20)
fig.suptitle('Error Terms', fontsize = 20)
plt.xlabel('Errors', fontsize = 18)
You see that the errors are normally distrubuted.
You can see from this plot that the errors are normally distributed. Thus the third assumption of Linear Regression is satisfied.
These are the two assumptions that are easily verifiable and will do for today. The assumption of linearity is seen when the model does well in predicting based on unseen data.
Having verified the assumptions, we will move to the stage of making predictions using test data
Predictions for Test Data
Assuming you have X_test and y_test split and ready, you scale the data and then jump into the prediction right away.
bikes_test[num_vars] = scaler.transform(bikes_test[num_vars])
# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test[X_train_new.columns]
# Adding a constant variable
X_test_new = sm.add_constant(X_test_new)
# Making predictions
y_pred = lm_6.predict(X_test_new)
Voila, y_pred is the predicted data for the X_test data. Note the same scaler variable is used for scaling the test data. The same linear model is used for prediction - from our train set. Now on to the Model evaluation
Model Evaluation
One of the easiest metrics is the r2 score. Find the R2 score
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
We get a score of 0.7961390438459766 - close to 80%. The train data gave an r2 score of 83% and test 80% and that seems a reasonable model to go with. Done!! we have our model ready.
This would be enough if we are looking only to use the model for predictions. This model could be deployed for predictions on any new data.
Inference & Recommendations
However, if you recollect, I mentioned early on that the beauty of MLR is that it is interpretable, explainable and can provide insights into what affects the predicted values. This we can do by looking at the coefficients of all the significant variables.
Let's have a look at the summary of the model
OLS Regression Results
================================================================
Dep. Variable: cnt R-squared: 0.835
Model: OLS Adj. R-squared: 0.832
Method: Least Squares F-statistic: 253.0
Date: Mon, 08 Jun 2020 Prob (F-statistic):3.13e-188
Time: 15:35:45 Log-Likelihood: 498.79
No. Observations: 510 AIC: -975.6
Df Residuals: 499 BIC: -929.0
Df Model: 10
Covariance Type: nonrobust
================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------
const 0.0750 0.019 4.031 0.000 0.038 0.112
workingday 0.0561 0.011 5.024 0.000 0.034 0.078
temp 0.5499 0.020 27.861 0.000 0.511 0.589
windspeed -0.1552 0.025 -6.195 0.000 -0.204 -0.106
summer 0.0886 0.010 8.608 0.000 0.068 0.109
winter 0.1307 0.010 12.600 0.000 0.110 0.151
sep 0.0974 0.016 6.184 0.000 0.066 0.128
saturday 0.0675 0.014 4.693 0.000 0.039 0.096
light -0.2871 0.025 -11.611 0.000 -0.336 -0.239
mist -0.0800 0.009 -9.143 0.000 -0.097 -0.063
2019 0.2331 0.008 28.370 0.000 0.217 0.249
================================================================
Omnibus: 68.639 Durbin-Watson: 2.089
Prob(Omnibus): 0.000 Jarque-Bera (JB): 151.839
Skew: -0.731 Prob(JB): 1.07e-33
Kurtosis: 5.238 Cond. No. 11.6
================================================================
Look at the coef column. Either high positive or high negative values are the strong influences. You can see temperature has the highest coefficient of 0.55 and is positive - this is the highest influencer. This implies that for every unit increase in temperature, the count of bikes increases by 0.55 units.
You can also see that windspeed, light and mist have a negative influence on the bikes counts. For every unit increase in each of these, the counts decrease by 0.15, 0.28 and 0.08 respectively.
The next positively correlated variable is year 2019. So, year on year, the counts have increased by 0.23 units. So, if all other things remained constant, the next year, the business would increase by another 0.23 units.
A working day causes a count increase and so do summer and winter. Thus all the 10 variables could be interpreted for their contribution to the bikes count.
So, the equation would be something like:
cnt = 0.0750 + (00561 * workingday) + (0.5499 * temp) + (-0.1552 * windspeed) + (0.0886 * summer) + (0.1307 * winter) + (0.0974 * sep) + (0.0675 * saturday) + (-0.2871 * light) + (-0.0800 * mist) + (0.2331 * yr)
So, given a new set of data, the above equation is used to make the prediction. Now, it looks so logical and feasible, isn't it? The model that gives the best line fit equation is the MLR model that we are looking for.
We not only can predict, we also get a sense for what affects your business. From this, we see extraneous conditions like temperature, seasons have a major impact. Keeping that in mind, we have to maximize those days for greater sales by ensuring good availability of bikes or we can give incentives to increase the usage of bikes on misty, cold days or not so favourable weather conditions. Like this, many insights can be obtained and actions can be taken to improve the business.
Does it sound that magical anymore? Not really. Once you know the trick, it is no more magic, it is just logic!! You are now, all the more empowered with data to make the right decisions apart from the fact that you can predict your future sales!!
First Machine learning algorithm - done. Welcome on board to the ML world!!
Comentários