The probability plot is a graphical method of determining if the data follows a given distribution. The probability distribution function can take a wide variety of shapes depending on the type of distribution. For example, for a normal distribution, it takes the familiar shape of a church bell. We can also plot the cumulative distribution function which plots the cumulative probability for all values less that or equal to given number. This cumulative distribution is monotonically increasing function that
goes from 0 to 100%.
In order to determine if the data follows a given distribution, it would be easier if we could tweak the graph somehow so that if the data points fall of a straight line. Any deviations from normality would result in the data points falling away from the straight line. This would make it easier for us to detect if the data points follow a specific distribution. There are three ways in which we can create a straight line probability plot.
Plot on a Probability Paper
The first method to plot the distribution is to use a probability paper. For example, for a normal distribution, we have the normal probability paper. Here, the spacing between the lines on the Y axis is not uniform. The spacing is designed in such a way that if you were to plot the cumulative distribution of a data that follows the normal distribution, it would fall on a straight line.
From this plot, we can conclude that the data points roughly follow a normal distribution.
Q-Q Probability Plot
One of the common ways of creating the probability plot is called a Quantile-Quantile (Q-Q) plot. The idea behind this plot is that if the data follows a certain distribution, then the quantiles for the given data should match the quantiles of the distribution. If the quantiles of the distribution are plotted on the X-axis and the quantiles of the data are plotted on the Y-axis then if the data follows the given distribution then the Q-Q plot should line close to the 45 degree line.
Let’s first clarify what we mean by a quantile. A quantile divides the data into equal subsets and the boundary values are the quantiles. For example, if we divide the data into 2 halfs where half the values are less than the quantile and half the values are greater than the quantile. This quantile is also called the median (or the 2-quantile). If we divide the data into 4 equal buckets where 25% of the data are less than Q1, 25% lie between Q1 & Q2, 25% lie between Q2 and Q3, and 25% are greater than Q3. We can say that Q1, Q2, Q3 are quartiles (or the 4-quantiles). If you divide the data into 100 equal buckets then we call those quantile values as percentiles etc.A q-quantile means breaking up the data into q equal-sized sets of data.
A Q-Q plot can be used to get an idea of the location (central value), scale (spread) and skewness (symmetric nature) of the distribution. It is more powerful way than just looking at a histogram. It is always a good idea to plot the data and understand the nature of the distribution rather than looking at P-values on a goodness of fit test. By looking at the plot, we can conclude the following:
If the Q-Q plot is flatter than the 45 degree line (Y=X) then spread of the data plotted on the horizontal axis is more than the spread of the data shown on the vertical axis
If the Q-Q plot is S shaped, then one of the distributions is more skewed compared to the other distribution (i.e. of the tails is longer than the other)
P-P Probability Plot
A P-P probability plots the two cumulative distribution functions against each other. Both the axis are probability values and thus vary from 0 to 1.
We can compare the data to a comparison line of 45° line from (0,0) to (1,1). The two distributions are equal if and only if the plot falls on this line – any deviation indicates a difference between the distributions.
If the number of data points are small, the P-P plot and the Q-Q plots may look very similar. However, for large data sets, there may be some differences. The fat tails are more distinctive in the Q-Q plot, whereas the bi-modality is more distinctive in the P-P plot.
Follow us on LinkedIn to get the latest posts & updates.