Correlation is a measure of the linear relationship between two or more quantitative variables. We usually quantify the degree of relationship between the two variables using the correlation coefficient value which varies from -1 to +1. If the correlation coefficient is close to +1, we say that that variables are strongly positively correlated, if the correlation coefficient is close to -1, we say that they are strongly negatively correlated, and if the correlation coefficient is close to 0, the variables are not correlated. For example, in the figure below, we can conclude that increase in advertisement spend is negatively correlated with sales revenue, increase in training hours is not correlated with sales revenue, and increase in sales promotions (discounts, giveaways) is positively correlated with sales revenue.
In the last article, we looked at correlation and how to calculate the correlation coefficient. In this article, we will talk about some of the common pitfalls that are committed when using correlation analysis.
Lack of Adequate Sample Size
If the sample size is not adequate, then it is possible that value we are observing for the correlation coefficient is due to chance alone. The next time we collect data, we may not see a similar pattern as shown in the example below. In order to address this issue, we should look at not only the correlation coefficient but also the P value. We need to ensure that the correlation coefficient is statistically significant. Sigma Magic software has a built-in algorithm that checks if your sample size is adequate for this analysis. Make sure that you have a greater number of samples than recommended by the minimum sample size algorithm.
Impact of Outliers
Outliers are data points that are significantly different from the other data points. They lie far away from the “center” of the data points and can have a significant influence on the correlation coefficient. In the example below, you can see that just by having one outlier, the correlation coefficient changes from 0.08 to 0.7. Hence, we must be careful how we interpret our correlation analysis results in the presence of outliers.
Lack of a Linear Relationship
Correlation coefficient measures a linear relationship between the variables. If the variables are not linearly related, then the correlation coefficient can show a small value but from the figure we can clearly see that there is some sort of a relationship between the two variables. Hence, before we interpret correlation coefficient values, it always makes sense to plot the variables first so that we can check the “linear” relationship between the variables before we interpret the correlation coefficient values.
Small Effect Size
If the correlation coefficient is close to 1, it only means that there is a strong positive relationship between X and Y. However, it does not provide an indication of the slope of the curve. For example, for the two cases shown below, the R value is close to 1, immaterial of the slope of the data points. In the second figure below, the slope of the curve is almost close to 0, which means that increase in investment has only a marginal impact on sales revenue even through the correlation coefficient is close to 1.0. Hence, when we interpret correlation coefficient values, we have to also look at the scatter plot to get an idea of the effect size – if the effect size is practically unimportant, then have a large correlation coefficient value is not of much value.
Correlation Does Not Imply Causation
Last but not the least, let’s say that we have addressed all the above issues and we get a value for the correlation coefficient (say R = 0.87). Does that mean X causes Y? The answer is NO. This is one of the major problems with use of correlation analysis and trips up a lot of people who perform this analysis. Just because the correlation coefficient between X and Y is high, we can only conclude that they are correlated but we cannot say for a FACT that X causes Y. Correlation and causation are two different entities. A positive value of correlation implies that if X is increasing, then Y is also found to be increasing. Causation implies that in fact an increase in X causes the Y to increase. We should be careful on how we interpret the correlation analysis results. Causation can only be derived using logic or physical principles. The statistical software has no idea on what data is being entered and it will be unable to tell us anything about causation. Hence, if two variables are correlated, we need to look for an external “plausible” explanation for causality. Here are a few examples of things that are correlated but in fact are not causal.
Reverse correlation: There is no order to how data is entered for performing correlation analysis. Hence, there may be a causality in one direction but not the other. For example, if you perform a correlation analysis between number of hours of study vs. grades, we will find a reasonably high correlation between the two. It may be okay to conclude that higher number of hours of study causes better grades, but it would not be proper for us to conclude that higher grades results in greater number of hours of study. The causality is only applicable in one direction. Here are a few examples that only work in one-direction:
Watching more TV makes children violent
Water is bad because it is the primary ingredient in herbicides and pesticides
Third Variable: Sometimes, a third variable may be causing a correlation between the two variables. For example, temperature is directly correlated with ice-cream sales. More ice-cream scales happen in summer vs. winter months. Similarly, more out-door crime happens in summer months compared to winter months since more people are out-door. So, if we were to perform a correlation analysis between ice-cream sales and out-door crime, we would find a positive correlation. However, we cannot conclude about causality here and say ice-cream sales causes out-door crime. We can only say increase in temperature causes increase in ice-cream sales or increase in temperature causes increase in out-door crime. Here are a few examples of third-variable correlation:
Increase in population of India causes increase in population of China
Increase in crime rate on a full moon
Chance Events: Sometimes, the data is random but seems to exhibit a pattern or relation between the two variables. If you were to collect more data, then the conclusion would probably be different. This may be due to the human mind attributing patterns to events when none exist. Here are a few examples:
Arthritis gets worse during cold weather
If you wet your hair you may catch a cold
Pregnant cravings depend on the sex of the child
In conclusion, correlation is a powerful analytical tool that helps us determine if two variables are correlated. The problems usually arise when we try to infer about causality by looking at the correlation coefficients. We need to take extra caution and look for factor outside statistical analysis to draw inferences about causality.
Follow us on LinkedIn to get the latest posts & updates.