As we all know, variation is a part of life, all processes have variation. When we collect data from a process, we often want to determine its central value. What we mean by the central value is the one value that pretty much represents the data set. The question is how do we calculate this central value?
The most common measure used in a lot of cases is the mean or the average value. For example, if the students in a class get their marks on the exam, the teacher might report the average value that the class obtained.
Mean Value: This can be easily computed as the sum of all the values divided by the number of values. However, the average value might not be appropriate in all situations. For example, if we want to find the average pay within a company, the average may not truly represent the central value in case there are outliers in the data (i.e. some people like the CEO get paid a very large amount of salary compared to other workers at the bottom of the pyramid). In general, if the distribution of the data is not very symmetric or it contains outliers (very large or very small values), the mean value of the data set may not give the true indication of the central value of the data set. In order to address this problem, there are several alternative approaches that are used.
Modified Mean: To calculate the modified mean, the largest value and the smallest value are deleted from the data set and the average is computed for the remaining items. This approach is used to average the scores of several judges in an Olympic competition.
Trimmed Mean: To calculate the trimmed mean, we drop the top x% and the bottom x% of the data points and then calculate the average of the remaining items. For example, if we calculate the 5% Trimmed Mean and there are 80 data points, we order the data points in increasing or decreasing order and drop the bottom 4 points and the top 4 points and take the average of the remaining 72 data points. Similarly, we can calculate the 20% Trimmed Mean etc. If we use 25% trimmed mean, we can also call this as the Inter Quartile Mean.
Winsorized Mean: To calculate the Winsorized mean, we don’t drop the top x% and bottom x% but we replace the top x% with the largest value of the remaining items and the replace the bottom x% with the smallest values of the remaining items to calculate the mean value.
Median: To calculate the median or the central value, we calculate the 50% percentile. We drop the bottom 50% and the top 50% of the data points and look at the central value that is left and this value would be the median. In a way, this could be considered as the 50% Trimmed Mean (extreme case).
Let’s look at an example where the data comes from a uniform distribution (it is symmetric). The values are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. In this case, all the central values we compute will be identical.
Next, let’s look at a case in which the distribution is not symmetric. We look at an example of calculating the central value for the total crime in high income OECD countries. The data for this exercise was obtained from: https://www.nationmaster.com/country-info/stats/Crime/Total-crimes and is shown in the table below.
As you can see from this data set, there are significant differences in the central location values. Now, the question is which method should we use to report the central value? Clearly, if there are no outliers and the distribution is symmetric, we should use the mean value since it has nice statistical properties. In the presence of outliers or a non-symmetric distribution, we could use one of the other methods discussed above. A lot of people use the median value – true that median values are not influenced by outliers, but we lose a lot of information in the process of computing the median. We can only report that half the values are less than the median and half are greater than the median. The modified mean may work in cases where there can be at most one outlier. Trimmed mean can be used to handle a slightly larger number of outliers – typically we would use a 5-10% Trimmed mean. However, this does not guarantee that all outliers have been removed from the data set. If we use a larger 20% Trimmed mean, then there would be greater loss of information in computing the mean value. Winsorized mean may work in cases where there are measurement errors. We replace the largest and smallest values with values that are more representative of the data set.
In addition to using the above methods to calculate the mean value, we could also develop some heuristics and compute the mean value (such as eliminating all outliers from either end). However, we must be careful that we don’t distort the meaning and interpretation of the central value by doing this. For example, if the delivery time has large outliers, we cannot delete the outliers and report to management saying that the average delivery time is only 2 days (when some customers are experiencing 15 day wait). If we can be sure that the outliers are one-off events and are not going to repeat in the future, it may make sense to remove the outliers to compute the average. If not, it would make more sense that we focus on and investigate the outliers and fix the root cause so that these outliers don’t happen in the future.
Follow us on LinkedIn to get the latest posts & updates.