Outliers and their treatment
Outliers in data analysis and data preparation are to be considered in specific ways so that the data that is fed to a machine learning algorithm is good. As it is well known, "Garbage in, garbage out". There are times when Outliers can completely skew the results and predictions and cause very poor performance of the algorithms.
So, let us see what is an outlier and what are the types of outliers? How do you deal with them, with some examples?
What is an Outlier?
the dictionary explanation for an outlier is "a person or thing situated away or detached from the main body or system". Very aptly explained.
So, when you have a data set and if most of the data is in some specific bound range and there are just a few points way out of that range, we treat them as outliers.
They could valid ones or introduced by manual errors.
How do you detect an outlier?
You could detect an outlier with simple statistical analysis of the data or through visualisation of the data.
You get basic statistical data if you 'describe' a column in pandas data frame like this:
( I have taken customers data for a loan application as examples throughout this article)
count 307499.000000 mean 27108.573909 std 14493.737315 min 1615.500000 25% 16524.000000 50% 24903.000000 75% 34596.000000 max 258025.500000
Here you can see that the 75th percentile is 34k but the max is 258k. Clearly, the max is way larger than the mean too. So, this gives a sense of having outliers in this data
another way is to plot box plots. If you see a box plot like this, you know there is one very large outlier:
Fig 1: Plot showing the number of customers social circles who have defaulted on loans in the last 30 days
Analysis of Outliers
Once you have detected that there are outliers, you need to analyse them as meaningful outliers or erratic ones. Let us understand the implications of the analysis.
Firstly, check if the outlier is purely an error? For example, if the data is capturing the temperature of cities and you see a city having a temperature of 500 deg centigrade. You know it cannot be true. it is erroneous data. You treat this differently as described in the next section.
There could be cases where the data does have a meaning. It is indeed valid but rare. Like you are analysing the income of your customers. In that, you have an Elon Musk or Warren Buffet. Their income is valid and they are very important customers but clearly, such an income is a rare outlier. This needs a different treatment
There are other times when you think it is bad value. For example, the days since one is employed in an office is all positive and translates to a reasonable number of years. However, you have one specific data value as 365243. Now this translates to 999 years and obviously an error. But then you realise 999 was entered for all those people who are unemployed. That can give a completely new insight if understood correctly rather than treated as an outlier.
Therefore, you need to do a thorough analysis of your outlier data before you start taking action on dealing with them. Having done the analysis...
How do you treat the Outlier?
The analysis plays a key role in how you treat the outlier.
Type 1: Treat as Missing values
If the outlier seems like a complete random error - say age is 1000 years or -45 years, you know it is an error and so you treat it as a missing value. You cannot make random guesses on how this value came in. So, you prefer to make it a missing value. Then you impute it just like you treat other missing values.
For example, if I saw data like in Fig 1, which shows that one customer had 350 of his connections defaulting on loans within a month, I know it seems like an error and I delete it.
Type 2: Cap the Outlier
Take a case where the outlier is a valid value but very rare. Take a look at this plot of incomes of my customers:
There is just one customer whose income is way too high. I do not want to delete this customer data but I do not want to skew the rest of my data with his income. The average income of my customers would completely get skewed if I took this as is. So, I would like to cap it to the 99.9 percentile value - which is good enough for me. I do not lose his data nor do I skew the rest of my data.
Once I cap the value, the spread I get is shown below:
Type 3: Bin the tail of Outliers
While I have treated that one outlier, I see there is now a long tail of customers whose income is still above the 50th percentile. there is no more a single outlier but a large number of customers with continuously increasing incomes. They are my high-value customers and hence I would bin them into a separate bin.
Once I bin them, I could have 4 bins as shown here. Now they do not seem to be outliers anymore but a bin that can contribute very differently to my analysis and machine learning algorithm. I have created categoricals out of this income data.
Type 4: Delete the Outlier
The last type of treatment we give Iis for data that looks just as in type 2. However, the context could be different. Say it is the age of people who have come for vaccinations. There is just one outlier of age 125 years while the rest of the ages are in the range of 18 to 99 years. Then, I know 125 years is valid but it is very rare. I do not mind not having this data and hence I delete that data point, consciously.
You can see all of the above plots and analysis in the python code here:
Very simple commands from the basic libraries of numpy and pandas are used to inspect the dataframes. Plots have been plotted using seaborn library and matplotlib. The three types of outlier treatment have been shown as examples. Most of it is self-explanatory after you have understood the concepts explained in the previous sections.
Outlier treatment is a very important step in data preparation and cleansing before feeding it to any machine learning algorithm. It requires thorough analysis before giving the treatment as the treatment could completely vary from case to case. Broadly speaking there are four ways of treating outliers which are all explained in this article.