Steps towards Data Science or Machine Learning Models
Having completed the basics of K-Means clustering in the last 3 weeks, I was tempted to take you through an example problem through code. That is when I realised that we do a lot of pre-modelling activities before we can jump into the model itself.
So, today instead of going into K-Means modelling, I thought, why not look at the steps we necessarily (may not be sufficient though) indulge in, before modelling of any sort.
Many want to learn Data Science and Machine Learning. And there is enough and more material available on the internet to learn. And sometimes that becomes the problem.
Libraries out there like scikit-learn make machine learning look like child's play until you start solving real-world problems. It typically consists of 2 steps - fit and predict. The fit() method fits against the available data creating a model. Then you use that model to predict against new or unseen data. Doesn't that look so simple?
In fact, here is a snippet of the code from the scikit-learn library which shows the simplicity of the exercise and the coding involved:
Four lines of code - in which 1 line creates the model, 1 line uses the model to predict. Then, is data science so simple and easy?
Much of the work is all before the model creation itself.
So, today I would like to list out a set of basic steps that have to be done before you get into modelling. This is just an indicative set of steps, nowhere exhaustive, but can serve as a starting point for modelling many algorithms that are fundamental to data science.
A birds-eye-view of the steps involved is provided in this mindmap:
The first step is to explore your data and understand it. Then you clean and do some basic transformations. After this, you are in a position to do a detailed Exploratory Data Analysis that gives your deep insights into your data. And finally, you prepare the data as expected by the algorithm.
Reading and understanding the data
This involves understanding the size, shape, data type, column names, the multiple sources of data. Here we also take a look at which data is categorical in nature and which is continuous.
You could even get some basic statistics like the minimum, the maximum, the average, the 75th percentile data, to get a feel for the spread in the numerical data.
Data Cleansing and transformation
First, you check for null values and see if you can treat them meaningfully. Else you drop that data as it could create problems later on.
This means you drop columns that have a high percentage of null values. For other columns with nulls, based on the data and the meaning of the column, you can impute it using various mechanisms, simples of them being to impute with a 0 or mean or median. There are advanced techniques too that can be employed to impute. Sometimes it may be good to leave them unimputed as you do not intend to skew the data in favour or against a value. You may choose the drop the specific rows, instead.
Coming to the transformations at this stage, it is to ensure your data is transformed to allow for a meaningful exploratory data analysis.
Firstly, you can plot graphs and check for outliers. If there are any, treat them as detailed in my article on treating outliers.
You may choose to create new variables through binning or through derivations from existing variables.
You can also transform categorical variables into numerical through techniques like one-hot encoding or label encoding
Now you are ready to start Exploratory Data Analysis
Exploratory Data Analysis
This is a very important step. Here is when you really get more familiar with your data through data visualization and analysis. You can see patterns and correlations between the predictors and the target variable or between the predictors themselves.
Broadly these are the steps involved in EDA.
Understanding correlation between data
Plotting and visualising the data for any of the above steps
Checking for imbalance in data
You do univariate analysis for categorical variables using bar charts and continuous variables using histograms as shown here:
Univariate analysis - Categorical Variables
Univariate Analysis - Continuous Variables
You could even draw box plots to understand the spread of the data from a different view.
Then you do a Bivariate analysis that analyses the relationship between two variables. It can include heatmaps for correlation analysis, pair plots for continuous-continuous data relationships, box plots for categorical-continuous relationships and bar plots for categorical-categorical relationships as shown here.
Heatmap - Correlation Analysis
Pair Plot: Continuous-Continuous Variables Relationships
Box Plots: Categorical-Continuous variable relationships
Bar Plots: Categorical-Categorical Relationships
You also check for imbalance in the target data so that you can model it through the correct means. At every stage, you draw some interferences about the data on hand.
You can see all of the above steps in detail through data and coding in this git hub repo on Exploratory Data Analysis. Sometimes the above steps may be slightly iterative.
Next, you start preparing the data for the model
If you have multi-level categorical variables, you create dummy variables to make them numerical columns.
You also need to scale numerical features. Typically, you scale features after splitting your data into train and test data so that there is no leakage of information from the test data into the scaling process. The various scaling techniques have already been discussed.
If you have a huge number of features, you could also go through feature selection through various techniques discussed earlier in Feature Selection Techniques including Recursive Feature Elimination, Manual elimination, Balanced approach and even Principal Component Analysis that I have not discussed so far.
Now that you have selected the features of importance, split the data into train and test sets and further scaled the data, you are finally ready to get into modelling.
Without all of these steps and may be more, if we do modelling, it would be the case of garbage in, garbage out. You may get very unstable models or actually run into errors thrown up by libraries as certian assumptions are violated.
The above process is often iterative in nature and goes on improving the data and the knowledge and insights from the data as you go through the same. It is only after this that you can start using the modelling techniques provided by various libraries and derive the benefits from the same.
It is often said that a data scientist spends a majority of his/her time on these steps more than in modelling itself. Without knowing any of the above, it would be futile to just learn modelling using libraries.