• Sai Geetha M N

Linear Regression Through Code - Part 1

#Tutorial

In an earlier blog post, I have spoken about "What is Regression?" and the basic linear equation too. This is one of the simplest algorithms but has solved many problems historically and still is very powerful for many use-cases. The explainability of the predictions is high and hence favoured in the data science community.


When you have one independent variable and one dependent variable, we call it simple linear regression. Practically, this is rarely used as no problem is so uni-dimensional. However, if we have multiple independent variables that impact the dependent or target variable, it is called a Multi-Linear Regression (MLR).


This post will walk you through one of the problem statements that is suitable for Multi-Linear Regression and how it can be solved using MLR.


The Problem

This is one of the well-known problems of predicting the demand for a Bike Sharing system in a particular city. A problem that many beginners work on.


The complete data and code are available here: https://github.com/saigeethamn/DataScience-LinearRegression


Solution:

The solution provided here takes you through the entire process starting from understanding the data, to validation of the model using the test data. This being one of my initial posts on an ML model, I plan to cover the whole process giving a peek into the entire model building process.


In Part 1 (this post), I will only describe the preliminary data understanding, exploratory data analysis, and data preparation parts required for the building of the model. This part would be very similar for most algorithms.


In Part 2, I will walk you through the actual model development, validation against test data, and also the validation of the assumptions of Linear Regression.


The whole solution is in python, easy to understand even if you are not familiar with that language.


Understanding Data

The first step is always to get familiar with your data before you decide what type of algorithm can help you. How do you get familiar with the data?


First, examine the sample data

# Read the data and do a preliminary inspection
bikes = pd.read_csv("day.csv")
bikes.head()

Next, see the summary of the data

You explore the data and its data types, along with the summary of all the numerical columns using commands like

bikes.shape
# (730,16)
bikes.info()
bikes.describe()

Finally, plot the target data

Here the target column that you are trying to predict is the count of bikes on any specific day that will be hired. So you can plot the target data against the dates and here's what you see


# Just viewing the general trend by Time
plt.figure(figsize = [25,5])
plt.plot(bikes['dteday'], bikes['cnt'])
plt.show()

There is a need to understand the data through the descriptions in the data dictionary that is provided at the end of this post in the Appendix


Once we have a preliminary understanding we go for Data Cleansing


Data Cleansing


What does data cleansing involve? Why is it necessary? No machine learning can help if the data is unclean. Garbage-in garbage-out is what you get in any model. There are also times when the model will just fail to execute if you have missing data or null values. There are other times that the whole model can become useless because of outliers completely affecting the learning process in the model.


Hence this is a very important step in any model development.


What does data cleaning involve? Some of the common steps are:

  1. Drop unnecessary data

  2. Inspect for null values and take corrective measures

  3. Transform categorical variables

  4. Check for Outliers and take corrective measures again


Drop unnecessary data

Not all data that we have, influence the target variable always. We have to figure out which of the variables have no relation, based on domain knowledge and remove those unnecessary variables.


In fact, we will see later that even those variables that have very little influence on the target would better be removed. We would ideally want the most significant variables so that we get actionable insights from the model.


There might be other use-cases where we do not want actionable insights but we would prefer highly accurate predictions and in such cases, we deal with unnecessary data differently and probably more liberally.


In this case, we drop the columns 'instant' and 'dteday' as the 'instant' variable is just an index to rows. Also, we are not doing a time series analysis here and hence do not need to use date.

bikes.drop(['instant','dteday'], axis=1, inplace=True)

Inspect for null values and take corrective measures


This data does not have null values and so nothing to do here. How we check if it has null values is with this simple statement.

bikes.isnull().sum()

Transform categorical variables


The columns 'season', 'yr', 'mnth', 'weekday', 'weathersit' are all categorical variables. You check on the distinct values they have and convert them to meaningful category names.


For example, 'season' has numbers 1 to 4 indicating the four seasons. This can be converted to spring, summer, fall, and winter based on the data dictionary, as shown here. Here is a statement that gives you the count of each type of season.

bikes['season'].value_counts()

This gives the result that there are 180 days of spring, 184 days of summer and so on in the two years of data, we have.

You convert it to meaningful categories with:

bikes['season'] = bikes['season'].map({1:'spring',2:'summer',3:'fall',4:'winter'})

You repeat this for the rest of the categorical variables as well.


Note that using numbers for categories gives a sense of order or importance. The algorithm might say 1 is smaller than 2 which is smaller than 3 etc. Or it may think that the most important category is Spring as it is 1. To avoid this kind of 'Order' to nominal variables, you convert them into category names.


Check for Outliers

Finally, you check for outliers in the numerical data. A box plot is a very good way to check for outliers (too far away points). Any points beyond the whiskers of the plot are outliers. ie. An outlier is any value that lies more than one and a half times the length of the box from either end of the box.


You plot and visually inspect to see if there are outliers.


cont_vars = ['temp','atemp','hum','windspeed','cnt']

plt.figure(figsize = [15,8])
i = 1
for var in cont_vars:
    plt.subplot(2,3,i)
    sns.boxplot(bikes[var])
    i += 1
    
plt.show()

You notice that there are hardly any outliers and hence no treatment is necessary.


Exploratory Data Analysis

This is the stage where you really get a proper understanding of the data, visualize it in various ways and get a feel for what might be the most important variables that affect the target variable. Are there relationships that seem obvious or is there some pattern that beats your domain understanding?

This helps you create your own hypothesis that you can validate later.


This part is done through univariate and bivariate analysis of all the data on hand.


Univariate Analysis

Here we look at all continuous and categorical variables to understand the spread and the behavior of the potential features independently. If you want to understand the types of variables, please read this post on the different types of variables.


A picture is worth a thousand words and hence we plot graphs for all the variables to understand their characteristics.


We typically use count plot for the categorical variables and dist plot for continuous variables.

cat_vars = ['season','yr','mnth','weekday','workingday','holiday','weathersit']

plt.figure(figsize=[15, 10])
plt.subplots_adjust(hspace=0.50,wspace=0.25)
for i,var in enumerate(cat_vars):
    plt.subplot(3,3,i+1)
    sns.countplot(bikes[var])
    plt.xticks(rotation=45)
plt.show()

From here what we notice are some obvious things like the count of data in seasons, months, and days of the week is as per the number of days. No Surprises here. So is it with years. Working day, holiday distribution is also as expected with a huge skew.


The only insight we get here that the clear days are 400+, misty days are 200+ and light rain in a few days and there are no days with heavy rain in the 2 years of data we have.


Similarly, you plot dist plot for continuous variables:

Here too you get a fair idea about the distribution of real temperatures and the temperature felt, humidity, wind speed, and the frequency of bikes rented. As expected, most of these should be close to a normal distribution and they are, with slight variations.


Bivariate Analysis

This is done to understand the relationship between two variables or features and also the relationship of all data with the target variable.