We encounter p values when we perform hypothesis testing. The reason for performing hypothesis testing is for inferencing information about the population based on sample statistics. We usually
do not work with population data as it is too difficult, time consuming or expensive to collect and work with that data. Hence, we use carefully selected sample data from the population and infer
parameters of interest about the population. When we perform any hypothesis testing, we generally have two types of hypothesis, Ho (also called as the null hypothesis) and Ha (also called as the
alternative hypothesis). These two hypotheses are the opposite of each other and only one of them can be true. For example, the null hypothesis could be a person is innocent and the alternative
could be that the person is guilty. Another example could be that the null hypothesis states there is no difference between the drug and a placebo, and the alternative hypothesis could be that
there is a difference between the two. By analysing the sample data, we need to determine whether we will pick the null hypothesis or the alternative hypothesis. Unfortunately, we cannot be 100%
sure in determining whether to pick Ho or Ha since we only have limited sample data and our conclusions can be incorrect. This is where the probability values come to our assistance.
The p-value or the probability value helps us determine the significance of our results of a hypothesis test. Hypothesis tests give us a p-value through which we determine the strength of the evidence
that we have collected. We can use the p-values and depending on how much error we are willing to tolerate in making our conclusions, we can then determine whether to pick Ho or Ha.
How to calculate p-values
Let us say we want to perform the following hypothesis test:
In order to determine the p-value, we need to do a thought experiment. Let us assume that Ho is true in the real-world. This is an assumption and a-priori we do not know if this is true or
not in reality. If we take it as a given that Ho is true, then we know that the mean of the population is 10. We collect some sample data from the population such as 11.1, 13.1, 12.5, 13.3, 11.7
(with an average of 13 and a standard deviation of 1.5). The question is whether it is possible that this data supports Ho or Ha.
We can then calculate what is the probability of the sample data we have
collected comes from this population. The probability distribution plot of this data is shown in the figure below.
From this figure, we can see that most of the data values are close to 10 and as we go further away from the mean value, the probability of finding that value belong to this distribution drops.
If someone asks, is it likely that a value 13 came from this distribution, the answer would be probably but the chance of seeing this is small. Of course, it is possible that we may get a value
of 13 since the normal distribution can theoretically generate values all the way from -infinity to +infinity but it is unlikely. If your sample data point were 100, it would be very unlikely
it came from this population. The more far away your sample average is from the mean value, the more likely it did not come from the same population. We can compare your sample data to the
population in two ways:
Sample Statistic
We can get an idea of how far a sample average (xbar) is from the mean of the population (mu) by computing its distance from the mean in terms of its standard deviation
If this value of Z is greater than some critical value Zcrit (say 2 for example), then we can conclude that it is highly unlikely that the observed sample data came from the same population.
Area Under Curve
Another way of looking at this is to calculate the area to the right of 13, which represents the probability of getting any value 13 or greater. For this figure, this probability (shaded area)
is 0.023 (2.3%). This shaded area is called the p-value. If this area is small (say less than 5%) then we could also conclude that it is unlikely that the sample data came from a population
for which null hypothesis is true.
Note that the calculation of the sample statistic (Z) and the p-value (p) depend on the distribution. In our example above, we used the normal distribution since we said that our data
follows the normal distribution. If the population standard deviation is not known (which is usually the case), we would use the T distribution instead of the Z distribution due to
the uncertainty involved in the value of the standard deviation. If you were comparing the ratio of two variances, then that ratio actually follows a F distribution, and we would
need to use that distribution to determine the critical value and the p-values. Similarly, for discrete data, we would use the appropriate distribution such as Binomial, Poisson,
Chi-Square etc.
Instead of working with the critical values which can vary from distribution to distribution and the problem you are trying to solve, we can work with the p-value which has the
same interpretation across all distributions and makes our life easier. Hence, in most cases, we use the p-value to draw conclusions about the hypothesis testing.
We do not need to worry about calculating the p-values since most commercial software will calculate and report these values for us. However, we need to be able to correctly
interpret these values.
How to interpret p-values?
The p-value tells us how often we would expect to see a test statistic as extreme or more extreme than the one that we have calculated using our statistical test if the null hypothesis was
true. Since the p-value is a probability number it lies between 0 and 1. As the p-value becomes smaller, the probability that the data we have observed comes from a population for which
Ho is true decreases. This p-value will now help us determine whether to pick Ho or Ha.
If the p-value is large, then it is highly likely that the observed sample data comes from Ho. We still cannot conclude definitely that it comes from Ho or that Ho is true. It is
still possible that we are seeing this by chance alone. Hence, we can conclude that there is no reason to believe that Ho is false or that we accept Ho (having no other contradicting
evidence).
If the p-value is small, then very likely the sample data that we are observing did not come from a population for which Ho is true. It is still possible that we get this value by chance,
it is highly unlikely. Hence, we can reject Ho as being unlikely and select Ha as our hypothesis.
In order to determine if a p-value is high or low, we usually compare it with alpha (usually 0.05) or the Type I error that we are willing to make. If p-value is lower than 0.05, we
consider it low and sufficient to reject the null hypothesis and accept the alternative hypothesis. We say that the result is statistically significant. That is, we are willing a make
a 5% error in our conclusions. In other words, by making our comparison with 0.05, there is a 5% chance that Ho may be true in reality, but we incorrectly conclude that Ho is false.
In order to determine significance, the most common threshold that is used is 0.05. However, this could be 0.01 or even 0.001 in some cases depending on the problem you are working on and
the degree of risk you are willing to tolerate in your decision making. For example, when a large financial impact is involved then we want to be doubly sure of our recommendations, we
may choose to use a smaller value of alpha.
Issues with p-values
Even though p-values are widely used in literature to report results, there are several issues that we need to be aware about otherwise we may incorrectly use them.
A statistically insignificant difference (p-value is large) does not mean that there is no difference between the groups in the population. It is possible that the sample size is
not large enough to detect this difference. A statistically insignificant outcome should not be interpreted as “absence of evidence”
Similarly, if the sample size is very large, small differences in the populations can result in statistical significance. You will end up getting a p-value that is less than alpha.
This only points to the fact that there is a difference between groups, but that difference may not be practically important. Hence, in addition to statistical significance, we
also need to consider practical importance.
We cannot look at the relative magnitude of the p-values to determine the statistical precision of the estimate. For example, when we are fitting a regression model, we could
have a model with a slope of 0.1 and a p-value of 0.01 while another model with a slope of 0.5 and p-value of 0.04. Both of these are statistically significant, but the second
model could have a better correlation coefficient and R2 fit compared to the first model even thought he first one has a smaller p-value.
If we perform a large number of comparisons, then by chance some of them can be significant. With more comparisons, we have a higher chance of false positives. This has an
impact on our conclusions about statistically significant finding. There are methods available to compensate for these effects, but they could end up increasing the required
sample size and thus time/cost of testing.
Recommendations
In order to address these limitations, we recommend the following guidelines when you are working with p-values.
Always formulate your hypothesis Ho and Ha first and decide your levels of risk (alpha and beta) before you collect your sample data and interpret your results.
If you are making multiple comparisons, be aware of the false positives and use techniques like Bonferroni or Tukey methods.
Ensure you have sufficient sample size for your sample data to avoid making Type I and Type II errors.
Do not just report the p-value, but also include your alpha value you are using to determine significance.
Make sure to not only consider statistical significance when making decisions but also look at practical importance values (is the effect you have determined practically important).
Finally, in addition to the p-value, also report the confidence interval of the metric you are working on to get an idea of the range of variation expected.
Follow us on LinkedIn to get the latest posts & updates.