• Sai Geetha M N

Linear Regression Through Code - Part 2

Last week, in Part 1, I walked through all the preliminary steps to be done before you can build a Linear Regression model. This week we will build the model itself.


A brief refresher about the problem I had taken you through: We have data of a Bike Sharing System in a particular city and we want to use Multi-Linear Regression to make predictions and also derive insights into what are the factors that impact the number of bikes hired.


The complete data and code are available here: https://github.com/saigeethamn/DataScience-LinearRegression


With this in the background, get ready for a bit of a roller-coaster ride as there are many new aspects to be understood and dealt with while building a model. I have tried to briefly explain them here. But to know it in detail, many more posts are required and I will certainly explain as I keep posting.


In this post, we will look at the process of selecting the most significant features using some techniques like Recursive Feature Elimination (RFE) and concepts like Variable Inflation Factor (VIF) and p-value. We will then look at the model building itself - which you will realise - is a fairly simple step, considering we will use python libraries that do it in a jiffy.


Once we have the model, we need to validate that the assumptions of Linear regression are satisfied by the model. Only then it is valid to use Linear regression for the data on hand. If it is indeed a valid model, then we apply the model to Test data and make predictions.


The beauty of Linear Regression is that it lends itself to explainability. What are the main features that impact the target variable and by what amount? This helps businesses to concentrate efforts on maximising or minimising those effects to improve bike hiring over time. Investments can be made in the right areas to grow the business.


Build the Model

The data on hand has 30 features. We certainly want only those features that impact the target variable significantly. This requires us to eliminate the not so significant features. There are multiple ways to do it and a topic in itself.

For today it will suffice to say that the various methodologies used to eliminate features are:

  • Manual feature elimination - tedious and not practical most often

  • Forward selection

  • Backward selection

  • Stepwise selection

  • Automatic feature selection

  • Recursive feature elimination - an automated way to get top 'n' features

  • Lasso regularization

  • Balanced Approach - coarse tune using automatic feature selection and the fine-tune using manual feature selection

And I go with a balanced approach - a combination of RFE and backward selection to narrow down the features of interest.


There are a few other concepts to understand but I will only briefly mention them here.

The need to reduce the number of features comes from various needs. Too many features can lead to:

  1. Overfitting - this happens when the algorithm almost memories the data and hence the train data performs very well but test data performs very poor

  2. MultiCollinearity - This implies that a variable is related to one or more of the other variables in the data. All the data are not completely independent of each other and hence lead to problems with interpretation and inference.

There are many ways to detect this and deal with both these issues.

Some methods that help in dealing with multicollinearity are:

  • Dropping variables - dropping those that are highly correlated and keeping only those that are of valuable interpretation to your business

  • Creating a new variable - that encompasses many other which are removed

  • Variable transformations using Principal component analysis

Here I go with dropping the highly correlated variables.

But how do we first understand which are the variables that are highly correlated and probably redundant?

This brings in the concept of Variance Inflation Factor (VIF). VIF helps to understand the variables that are highly correlated. Once you find them, you can and drop them. (VIF tries to build a model trying to predict one feature use all the other features. If it can be predicted well by the others, then it is highly correlated. This gives a high VIF score indicating it can be dropped )


With much of the theory (though very high level) out of the way, we'll get into looking at the implementation


Feature Elimination through RFE And Model Building


Here, we use the LinearRegression model from sklearn.linear_model and RFE from sklearn.feature_selection module, considering that we already have our train data split into the variables X_train and y_train.


RFE helps is get the top n features and here I am narrowing it down to the top 15 features.

lm = LinearRegression()
lm.fit(X_train,y_train)
rfe = RFE(lm,15)
rfe.fit(X_train,y_train)

The top 15 features get a rank of 1 and then the rest of the features are ranked with subsequent numbers. You could list the columns and their ranking as shown here:

list(zip(X_train.columns,rfe.support_,rfe.ranking_))

The output looks like this:

[('holiday', True, 1),
 ('workingday', True, 1),
 ('temp', True, 1),
 ('atemp', False, 6),
 ('hum', True, 1),
 ('windspeed', True, 1),
 ('spring', True, 1),
 ('summer', True, 1),
 ('winter', True, 1),
 ('aug', False, 8),
 ('dec', False, 3),
 ('feb', False, 4),
 ('jan', True, 1),
 ('jul', True, 1),
 ('jun', False, 13),
 ('mar', False, 15),
 ('may', False, 7),
 ('nov', False, 2),
 ('oct', False, 14),
 ('sep', True, 1),
 ('monday', False, 9),
 ('saturday', True, 1),
 ('sunday', False, 5),
 ('thursday', False, 11),
 ('tuesday', False, 10),
 ('wednesday', False, 12),
 ('light', True, 1),
 ('mist', True, 1),
 ('2019', True, 1)]

Now you use only these 15 features to build your next linear model using OLS (Ordinary Least Squares) from statsmodel.api module.


You use OLS from statsmodel (instead of LinearRegression from sklearn) as it gives you a good summary of many measures as well as the p-values that you need in subsequent steps for further manual elimination of features.


col = X_train.columns[rfe.support_]
X_train_rfe = X_train[col]
X_train_lm = sm.add_constant(X_train_rfe)
lm_1 = sm.OLS(y_train,X_train_lm).fit()
print(lm_1.summary())

And with this, you have created the first iteration of your OLS model with 15 features. In line 1 and 2, I am filtering out only the 15 columns of interest and ensuring I pick only those from X_train. In the 3rd line, I have explicitly added a constant because OLS assumes that the line that fits the data, always passes through the origin.

Next line I call sm.OLS and fit the y_train, X_train data. This is the magical line that generates the coefficients for the line that passes through these 15-dimensions of data. And there lm_1 is my first Linear Regression Model.


Next, you can call for the summary of the model and check some of the statistics like R-squared and Adjusted R-squared to understand how good a model you have got. May be along with AIC and BIC scores too. (Not getting into explaining any of these here, can look forward to more posts on these.) It may be seen that the scores are decent with r-squared being 0.848 which implies that 84% of the data is explained by this model.


Apart from this you also want to check if all features are significantly contributing to the model. This is the summary displayed:

                       OLS Regression Results                            
=================================================================
Dep. Variable:             cnt   R-squared:                0.848
Model:                     OLS   Adj. R-squared:           0.844
Method:          Least Squares   F-statistic:              184.0
Date:         Mon, 08 Jun 2020   Prob (F-statistic):   4.22e-191
Time:                 15:35:43   Log-Likelihood:          519.65
No. Observations:          510   AIC:                     -1007.
Df Residuals:              494   BIC:                     -939.5
Df Model:                   15                                  
Covariance Type:     nonrobust                                  
=================================================================
             coef  std err        t    P>|t|    [0.025    0.975]
-----------------------------------------------------------------
const      0.2519    0.036    6.969    0.000     0.181     0.323
holiday   -0.0582    0.027   -2.164    0.031    -0.111    -0.005
workingday 0.0433    0.012    3.762    0.000     0.021     0.066
temp       0.5096    0.034   14.837    0.000     0.442     0.577
hum       -0.1563    0.037   -4.188    0.000    -0.230    -0.083
windspeed -0.1863    0.025   -7.310    0.000    -0.236    -0.136
spring    -0.0509    0.021   -2.464    0.014    -0.091    -0.010
summer     0.0508    0.015    3.423    0.001     0.022     0.080
winter     0.0933    0.017    5.403    0.000     0.059     0.127
jan       -0.0345    0.017   -1.989    0.047    -0.069    -0.000
jul       -0.0529    0.018   -2.931    0.004    -0.088    -0.017
sep        0.0814    0.016    4.945    0.000     0.049     0.114
saturday   0.0536    0.014    3.694    0.000     0.025     0.082
light     -0.2475    0.026   -9.507    0.000    -0.299    -0.196
mist      -0.0563    0.010   -5.439    0.000    -0.077    -0.036
2019       0.2305    0.008   28.795    0.000     0.215     0.246
=================================================================
Omnibus:                66.260   Durbin-Watson:            2.080
Prob(Omnibus):           0.000   Jarque-Bera (JB):       159.826
Skew:                   -0.678   Prob(JB):              1.97e-35
Kurtosis:                5.383   Cond. No.                  22.1
=================================================================

If the p-value is greater than 0.001, you know it is insignificant and hence that variable can be dropped.

Here I notice that the p-value of 'jan' is 0.047, pretty insignificant and hence this can be dropped.

You always check the p-value and the VIF value. However, even if the p-value is < 0.001, but VIF is > 10, you know it is correlated to other variables and hence you drop it.

So, lets check the VIF values too:


# Calculate the VIFs for the model
vif = pd.DataFrame()
X = X_train_rfe
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

This is the output:

    Features     VIF
3    hum	 29.37
2    temp	 17.78
1    workingday   5.31
4    windspeed    4.73
5    spring       4.53
7    winter       3.46
6    summer       2.85
13   mist         2.29
14   2019         2.09
11   saturday     1.98
8    jan          1.67
9    jul          1.59
10   sep          1.39
12   light        1.24
0    holiday      1.18

Note here that even though humidity has a very high VIF value, first, I would be comfortable eliminating 'jan' as a variable as it is not contributing significantly to my model with a p-value of 0.047


Once I drop 'jan', I repeat the above two steps and check on p-values and VIFs. You can check out the iterations in the full code. Through the next few iterations, I eliminate 'holiday', 'spring', 'jul' as they show up as insignificant features. Later, 'hum' is eliminated as it has a high VIF value despite a low p-value indicating that it shows multi-collinear behaviour.


With this, we arrive at 10 features all of which are significant and are not correlated to each other. This last model with the 10 variables is the final model that is what we will go with. Therefore, the 'lm_6' model in the code has the entire final model.


Is it really THE model? To answer this, we need to verify by validating whether the model lives up to all the assumptions of linear regression.


Validate the Assumptions of Linear Regression

What are the assumptions of LR?

  1. There is a Linear relationship between target and predictors/features

  2. Error terms are Independent of each other