How should we handle outliers in our data?

Blogs

Sign-up to receive the latest articles related to the area of business excellence.

How should we handle outliers in our data?

Author: Palak Kumar

Overview

In order to make fact-based decisions, we often need to use data to draw inferences and make conclusions. There are several powerful statistical tools available that can be used for analysing the data. However, some of these tools are very sensitive to the presence of outliers in the data. If we ignore the outliers or use the wrong statistical tools for analysis, we may end up drawing the wrong conclusions. Hence, it is important for us to understand how to handle outliers. First question is what are outliers? Outliers are the extreme values that exhibit significant deviation from the other observations in our data set. By looking at the outlier, it initially seems that this data probably does not belong with the rest of the data set as they look different from the rest. For example, if we have the following data set 10, 20, 30, 25, 15, 200. If looks like the value 200 is probably an outlier and does not belong with the rest of the data points. If we incorrectly ignore the presence of outliers in our data, we may end up making the wrong conclusions. In this article, we will look at what causes outliers and how should we deal with the outliers that may be present in the data. Should you ignore the presence of outliers in our data? delete them? or transform them in some way before we use the data?

Causes of Outliers

One should never delete any outliers in the data without proper investigation since outliers may contain a lot of valuable information which will be lost if the outliers are deleted. Hence, it is important that we question and analyse our outliers before determining a course of action. Here are a few common causes of outliers in a data set:

Data entry errors: These are caused by human errors during data collection, recording, or entry. For example, annual orders for one customer are thousand and accidentally the person entering the data quotes and additional zero in thousand. This way the order will become 10 times higher and this obviously will be an outlier value as compared to the other customers.
Measurement errors or instrument errors: This one is the most common reason for outliers. Such type of error occurs when the instrument becomes faulty. For example, there are 10 voltmeters out of which 1 is faulty and 9 are correct. So, the data collected on the 9 voltmeters would be correct, however, the data collected on the faulty voltmeter would be higher or lower than the rest of the data collected.
Sampling errors: Consider an example where we have to measure the weight of athletes but by mistake, we also include some wrestlers in the sample now this inclusion is very likely to cause outliers in the dataset.
Data processing error: While performing data mining data is extracted from multiple resources there is a possibility that due to some manipulation or extraction errors there are some outliers in the data set.
Natural novelties in data: The outliers that are not caused due to any error are called Natural Outliers. For example, in a class of 50 students, 45 students perform average in a test while 3 students perform excellently it and 2 students perform poorly in the test. Now the students who performed excellently and the students who performed poorly are outliers, but they are not caused due to an error.

How outliers affect our results?

Considered you have a medical equipment online business, and you are optimizing your revenue metrics for average order value or revenue per visitor. And in the data, you have obtained you got some outliers. On analysing you found that there were certain resellers that ordered your products in bulk and those orders valued far from your typical orders. Those were the cause of your outliers. However, those resellers could be your very loyal customers, so you have to consider them in your data set.
Outlier Fig 1

In the cases when you have a small sample size, these outliers can significantly mess up all your results. There are many unfavourable impacts of outliers – for example, you receive a large customer order in one month which may significantly impact your overall average order intake per month. Using this new average can cause you problems in subsequent months when such significant orders are not received. For statistical analysis of data, outliers can impact the normality test results of our data, invalidate the basic assumptions like constant variances for regression testing etc. Let us consider an example of data with and without outliers.

Table

Here we can clearly see that the outliers can significantly affect results in the first scenario. Without the outlier, our mean is 5.45 but with the outliers, it increases to 30 and this changes the standard deviation completely.

How to detect outliers?

The outliers of the data can be detected using certain statistical plots, the most common plots are Box Plot and Scatter Plot.

In this box plot, you can see extreme outliers in red and mild outliers in green.

Though a scatter plot also shows the outliers, it is quite difficult to tell where the extreme and mild outliers are. And if they are outliers or not.
Histogram Plot

Even a histogram is very useful in determining outliers in the data set. In the above histogram, we can see that the 12th item is an outlier. And depending on the data we are working with we can use different kinds of a plot to determine the outliers.

Dealing with Outliers

Deleting the values: You can delete the outliers if you know that the outliers are wrong or if the reason the outlier was created is never going to happen in the future. For example, there is a data set of peoples ages and the usual ages lie between 0 to 90 but there is data entry off the age 150 which is nearly impossible. So, we can safely drop the value that is 150.
Changing the values: We can also change the values in the cases when we know the reason for the outliers. Consider the previous example for measurement or instrument errors where we had 10 voltmeters out of which one voltmeter was faulty. Here what we can do is that we can take another set of readings using a correct voltmeter and replace them with the readings that were taken by the faulty voltmeter.
Data transformation: Data transformation is useful when we are dealing with highly skewed data sets. By transforming the variables, we can eliminate the outliers for example taking the natural log of a value reduces the variation caused by the extreme values. This can also be done for data sets that do not have negative values.
Using different analysis methods: You could also use different statistical tests that are not as much impacted by the presence of outliers – for example using median to compare data sets as opposed to mean or use of equivalent nonparametric tests etc.
Valuing the outliers: In case there is a valid reason for the outlier to exist and it is a part of our natural process, we should investigate the cause of the outlier as it can provide valuable clues that can help you better understand your process performance. Outliers may be hiding precious information that could be invaluable to improve your process performance. You need to take the time to understand the special causes that contributed to these outliers. Fixing these special causes can give you significant boost in your process performance and improve customer satisfaction. For example, normal delivery of orders takes 1-2 days, but a few orders took more than a month to complete. Understanding the reason why it took a month and fixing this process can help future customers as they would not be impacted by such large wait times.

The above are a few common ways of dealing with outliers. If there are other methods that have worked for you, feel free to share your experience of how you handled outliers.

Conclusion

Outliers are the extreme deviated values in data that can cause variances in results and can impact our analysis outcomes. There are many causes of outliers in a data set such as sampling errors and measurement errors. Before dealing with outliers we also need to detect the outliers, and this can be done via methods like box plot, scatter plot, and histogram. We should not just drop the outliers from our analysis since in certain cases outliers can give valuable information about our processes. There are lots of ways to deal with outliers in data and there is no quick fix or magic to handle them - in most cases human expertise and experience comes into play to decide how to best handle outliers in our data.

Blogs