Where is my variation coming from?

Blogs

Sign-up to receive the latest articles related to the area of business excellence.

Where is my variation coming from?

How would you quantify and report the amount of variation you are getting in your process? The most popular measures are the range and standard deviation. Here are some questions to ponder: How do we calculate the range and standard deviation for a given data set? Which value should you report for variation of the data set? Is the mean and standard deviation sufficient to describe your data set? What are some real life examples of usage of range and standard deviation in the real world? This blog looks at these values to better understand how we should report the variation in our data set.

Introduction

What do we mean by variation? Most data you collect typically has variation. You will not get the same value each time you perform the measurement. This variation could be either due to changes in the inputs, the way we run the process, differences in the person who runs the process, the equipment we use to produce the output or due to measurement error. Variation is a fact of life! So, if we collect data and we are getting different values, how do we report the amount of variation that we have in this data? There are several ways to measure the variation – the most popular being the range and standard deviation.

Theory

One measure of variation is called the range (R). The range value is the difference between the maximum and minimum values of the data set. It is always a positive number (or zero). Larger the value of the range, greater is the variation in the data. A range of zero implies no variation in the data set. A second measure is called the standard deviation. The standard deviation is approximately the average amount of deviation we have in the data measured from the mean value. The standard deviation is also a positive number (or zero). A standard deviation of zero implies that there is no variation between the data points and the mean value. Larger the standard deviation, larger is the amount of variation you have in your data set. In order to calculate the standard deviation, you cannot just compute the average of the deviation of each data set from the mean value. If you do this you will find that they will always be zero because some deviations will be positive (data is larger than the mean) and some deviations will be negative (data is smaller than the mean). So, the average value of the deviation is of no use! In order to calculate the standard deviation, we square these deviations so that deviations far from the mean value are penalized more than those close to the mean; calculate the average of these squared values (which are all positive numbers due to the squaring) and then take the square root. The formula for the standard deviation depends on whether we are working with population data or sample data.

Population data is all the data in the universe while sample data is a subset of the population data. When we calculate the standard deviation, we see that there is a slightly different formula for calculating standard deviation depending on if we have sample data or population data. For population data, the standard deviation (sigma) formula uses the number of data points n in the denominator while for sample data, the standard deviation (s) uses the value (n-1) is used. The reason for this difference is that with limited data, we lose one degree of freedom to calculate the mean (xbar) so we are only left with (n-1) degrees of freedom to calculate the standard deviation. In addition, it can be shown that using (n-1) makes the estimate of the standard deviation value unbiased which means that as you collect more data, the sample standard deviation tends to the true population standard deviation without any error.

Stdev Formula

Application

Note that the range value only looks at the minimum and maximum values and ignores all values in between. Hence, range is not a really good indicator of the amount of variation in the data. In most cases, we would use the standard deviation to report the amount of variation in the data. The square of the standard deviation is called the variance and is used internally within the statistical analysis applications for performing calculations. However, during report out we prefer to report the standard deviation as it has the same units as the raw data. It is interesting to note that in the presence of outliers, both range and standard deviation are influenced by outliers. If you have outliers in the data, then ideally we should not be using the range or standard deviation, instead we should use another measure called the Interquartile Range (IQR). IQR is calculated as the difference between the third quartile (Q3) and the first quartile (Q1). How do we calculate these quartiles? We put the data in increasing or decreasing order and pick out the value from the data set such that 25% of the data points are less than this value. This value is called the first quartile (Q1). The third quartile is the value such that 75% of the data points are less than this value (Quartiles are covered in more detail in a different blog).

For example, both the following data sets have the same range but the amount of variation is different in each case:

Data Set 1: 1, 3, 5, 7, 9
Data Set 2: 1, 9, 9, 9, 9

The standard deviation on the other hand, uses all the numbers to calculate the variation and it will report that there is less variation in Data Set 1 (Stdev = 3.16) compared to Data Set 2 (Stdev = 3.58). You can think of standard deviation as the average deviation of the data points from the mean value (as an approximation).

Both range and standard deviation can be calculated for any data: continuous or discrete. The formula for calculating them does not change either or depend on the actual distribution of the data. For some distributions like the normal distribution, the mean and standard deviation values can help us calculate and understand the underlying nature of the data points and help us predict the probability of occurrence of different values. However, for other distributions, these parameters may not mean much or even required to describe the distribution of the data. Typically for discrete data, even though we can calculate the standard deviation, we usually don’t but internally within statistical analysis these computations are made (for example to calculate the control limits on a discrete control chart).

Some examples of the use of range in the real world are: temperature ranges for the day as reported on a weather report, min/max levels of water in a reservoir. The standard deviation of a process parameter, standard deviation of a share price in the stock market, the risk of investment is measured using the standard deviation.

Software

Sigma Magic software: Calculating the range and standard deviation is relatively straightforward. Just add a new Basic Statistics template to Excel by clicking on Stat > Basic Statistics. Copy and paste the data for which you want to calculate the range or standard deviation into the input area and then click on Compute Outputs.
Excel: You could also calculate the range in Excel using the formula =MAX(…) – MIN(…), sample standard deviation using the formula =STDEV(…), and population standard deviation using the formula =STDEVP(…).
Minitab: If you use the Minitab software, you can copy and paste the data into Minitab and then click on Stat > Basic Statistics > Display Descriptive Statistics. Then select the data column and then click on OK. This will print out the range and sample standard deviation for the sample values.
Exercise
Calculate the range, standard deviation, and interquartile range for the data set given in the following Excel file: Basic Stats 1. The analysis results will include the range and standard deviation (for the sample) values. The range value for this data set is 20 and the standard deviation is 5.008.

Follow us on LinkedIn to get the latest posts & updates.

Blogs