In this module, we will cover some basics about normal distribution. We will discuss what a normal distribution is and how to check if the data is normally distributed. Before we begin, we need to specify that we are working with data that is continuous. Refer an earlier article on the difference between continuous and discrete data points. Normal distribution is strictly only applicable for data that is continuous though in some cases we can use the normal distribution to approximate data that is discrete.
What is distribution?
A distribution graph shows the frequency of occurrence of certain values in the data set. An example distribution (which by the way is not normal) is shown below. The distribution shows which data values are more likely to occur. For example, for the figure below, the data values are all positive with the most likely values close to 1 (highest frequency). As the data values get close to 0 and very large, the frequency of occurrence also drops. The area under this curve indicates the probability of occurrence of these values. For example, if we are interested in finding out what is the probability of getting data points greater than 2, then we would need to calculate the area of the distribution (below the blue curve) that is greater than 2.
What is a normal distribution?
A normal distribution is a special type of distribution that arises when we are working with certain types of data. It is also referred to as the Gaussian distribution. A normal distribution is a symmetric distribution which is centered at the mean value and the width of the distribution depends on the standard deviation. For the normal distribution, the most frequently occurring values are close to the mean of the data set. An example normal distribution with a mean of 0 and a standard deviation of 1 is shown in the figure below. For the figure shown below, we can see that the 0 value has the highest frequency of occurrence and as the values are farther away from 0, the frequency of occurrence goes down. The total area under the distribution indicates the probability of occurrence of those values. For example, the shaded area shown in the figure below between -2 and +2 indicates the probability of getting values between -2 and +2. For this example, the area between these two limits is equal to 0.954 or there is a 95.4% chance that the data values lie between -2 and +2. Due to the complex nature of the curves, we cannot do these calculations by hand. We either need to use the computer to calculate these areas or look up a probability values handbook.
The following figure shows a normal distribution with a mean of 0 and a standard deviation of 0.1. Note that the most frequently occurring value is still close to 0 while the width of the distribution is much narrower because of smaller standard deviation (and hence variation of the data values). Most of the values are centered close to 0.
A normal distribution can be completely specified by two values, its mean (mu) and the standard deviation (sigma). Mathematically, the frequency of occurrence of a normal distribution can be represented as follows:
Why is normal distribution important in statistics?
Normal distribution has some special properties which are relied by some of the statistical tests – for example if you want to compare the mean values of two data sets. If the data is not normally distributed, the statistics get a bit more difficult to analyze and the statistical power of these tests is also a bit lower. If the data is normally distributed there are a lot more statistical tests available to analyze the data set. Hence, we always need to check if the data is normally distributed and if it is, then we can use the more powerful tests to analyze our data set.
How to check if the data is normally distributed?
We can visually plot the histogram of the data and superimpose the normal curve on the histogram to visually check if the data is following the normally distribution curve. The disadvantage of this approach is that the histogram may change based on the bin widths and there may be bias on how different people may interpret similar graphs especially when there is departure from normality.
The statistical way to check if the data is normally distributed is to perform the Anderson-Darling test of normality. In this approach, the data points are used to compute a test statistic (A) which measures the distance between the expected distribution and the actual distribution. If this statistic is greater than a certain critical value then the normality of the data is rejected. The test statistic, A, can also be converted into a P value. If the P value is less than alpha (default 0.05) then the data set is considered to be normally distributed. Ideally, we need at least 20-30 data points before we can check if the data is normally distributed.
Let’s look at the example of checking if the data is normally distributed for the following example. The data points show the time to drive to work in minutes for the last month: 30, 42, 28, 32, 25, 29, 27, 31, 38, 36, 31, 29, 27, 26, and 29. We want to check if the data is normally distributed. A histogram of the data points is shown below superimposed with a “blue” normal curve. Do you think the data is normally distributed? Though the histogram follows the blue curve to some extent, it does not closely follow the curve. Any conclusions we draw here are purely qualitative rather than quantitative.
Let’s perform the Anderson-Darling test. The results are shown below. From this analysis, we can see that the data is close to the normal distribution (look at the blue data points that lie close to the red normal distribution curve). If we look at the P value, we can also conclude that since it is less than the default value of alpha (0.05), the data set is not normal. This analysis can be performed in the Sigma Magic software using the template Normality Analysis that can be found under the “Stats” templates.
Follow us on LinkedIn to get the latest posts & updates.