Organizations today are investing heavily in data analysis in order to leverage the vast amounts of data that are being collected to improve business performance. Data scientists and analysts are expected to analyse the data correctly and draw the right conclusions. The decisions that are made based on data could cause companies to invest large amounts of capital to go after recommendations that may not provide much value to the company. Hence, it is important the sufficient care and caution be used in performing the right analysis and correctly interpreting the results. In this article, we will review a few of the commonly made mistakes in data analysis.
Mistaking correlation with causation
All data analysts are aware of the difference between correlation and causation. However, this difference still trips up people who are analysing the data and causes them to make bizarre conclusions. Causation implies that actions A and B have a cause-and-effect relationship with each other or action A causes action B to occur. Whereas correlation is simply a relationship that is the action A relates to action B, but one action may not be the reason for other action happening. We can check if variables are correlated by using a scatter plot. If they are positively correlated, then if one variable increases the other variable also increases. Similarly, they are negatively correlated if an increase in one variable corresponds to a decrease in another variable. There is no correlation if the two variables are independent of each other. Just because two variables are correlated does not mean that there is a causative relationship. There are some reasons why we get the two mixed up:
Sometimes A causes B but B does not cause A. When we are plotting the two variables on a plot, the software has no clue which is which so it will show a strong correlation between A and B. Note that correlation does not change if you plot A on the X axis and B on the Y axis or vice versa. We should be aware of the direction of causality to make the right conclusions. For example, consider an example of a coffee-making machine. The caffeine is extracted by the machine to make coffee. If we are monitoring two variables the time to extract the coffee and the caffeine percentage, then it might look that the extraction time is dependent on the caffeine percentage, however the true case is reverse.
Another example could be successful companies pay large dividends. So, we can expect some correlation between these two variables. However, if we incorrectly conclude that large dividends cause a company to be successful, we may end up making the company bankrupt by giving out large dividends and not using the retained profits to grow the company.
Sometimes when we plot two variables A and B on a correlation plot, they look strongly correlated but the reason for the correlation could be that both the variables A and B are actually correlated with time. If over a period of time, the variables A and B are increasing then if we plot the variables A vs. B it looks like there is a correlation between the two. This may cause us to conclude that there is a correlation between A and B when in fact there are none. For example, if we plot a graph between the market shares of internet explorer and murder rates as shown in the figure below there seems to be some correlation! According to the graph we can see a correlation between market share of internet explorer and no, of murders. However, they are independent of each other and tend to change over time. Hence, before making our conclusion we need to check if our conclusion makes sense and can be explained using a rational thought process.
Third Variable Problem
This is similar to the earlier case, but the third variable is not necessarily time but another intermediate variable. Here A cause B and A also causes C but when we plot B and C together it looks like B causes C which may not be true. The right conclusion would be that A causes B and C. Let us look at an example of a building on fire. On the basis of data we collect, we might falsely interpret that the greater the number of firemen more the damage that is caused to the buildings. This is clearly not the right conclusion. The intermediate variable in this case is the level of fire that exists. If the fire is large, then it is likely that more firemen were called to assist with the dousing of the fire and greater was the damage caused by the large fire. Always, look for an intermediate explanatory variable before you draw conclusions from your data.
When we are working with data analysis, we are more interested in causation as compared to correlation even though correlation is a lot easier to determine using either graphical or statistical means. Determining causality is never perfect in the real world. However, there are a number of techniques that we can use to find evidence of a causal relationship. We can conduct a controlled randomized experiment to determine if there is causality. For example, we can conduct an experiment for a give level of fire and see if sending different number of firemen causes more damage. More often than not, if you are concluding causality there needs to be a probably physical explanation of the reason for this causation. If your conclusion does not pass this “common sense” test, then you are probably not right in concluding causality and additional randomized experiments may be required to claim causality.
Choosing the wrong visualization tools
Data analysts mostly focus on the technical aspects of data analytics; however, some analytical tools may often be a “black-box” and it may be difficult for us to really understand how the models actually work. Hence, data visualization plays a key part in deriving appropriate insights from the data. Even the best models will not give proper insights if the visualization is incorrect. Most data scientists choose the visualization type based on the aesthetics instead of considering the characteristics of the dataset. This should be avoided by defining the goal of visualization in the first step. Also, it is necessary that data analysts get familiar with the visualization methods in order to obtain effective results. It is recommended that the analysts use multiple different ways to visualize the problem and use the appropriate visualization tool that best depicts what the analyst is trying to bring out from the data analysis.
Focussing solely on Data
We must understand that data is not the only thing enough to build efficient model. Many times, data analysts start implementing the model and creating charts out of data without even thinking if the analysis will be advantageous to the organisation or not. Data analysts must also develop the required business acumen and not give too much decision-making power to data alone. Hence, the first question that we need to answer is what is the business problem that we are trying to solve. Only then we have to look for the appropriate data to help us answer those questions. More often than not, an organization may already have access to tons of historical data and trying to find the business problem so solve from the data is not the right approach. Organisations who hire the data analysts must also look for a combination of technical knowledge as well as domain understanding. Next, formulate from a business point of view what are the right problems to solve and finally, determine what data is needed to answer these questions. Some data may already be available and for the remaining missing data, create a data collection plan so that data may be collected and available for future analysis.
Ignoring the Probabilities
Sometime data analysts overlook the possibilities for a particular solution which can lead to wrong decisions. It is not always necessary that input A will give output B, there could be other factors as well. Therefore, informed decisions have to be made considering all the factors and before concluding anything, the complete scenario of model and its probability must be understood. Scenario planning and probability theory are two essential aspects of data science that must not be overlooked. The question to answer is are we seeing the results due to chance occurrence or is there a real pattern in the data.
Choosing the wrong analytics tools
The final mistake we will discuss in this article is picking the wrong analytics tools. There are multiple tools that can be used to attack any problem. If an analyst is comfortable with working on certain set of tools, they may end up using this tool for all types of problems even if that tool is not the best suited for that application. As the saying goes “If you have a hammer, everything looks like a nail!”. We need to consider several factors such as amount of data available, type of data available, amount of data required for training, robustness of the algorithms required, stability of the algorithms, applicability for the given scenario, scalability etc. It is important to be aware of the advantages of limitations of each of the analytics tools and use the most appropriate tool for the given situation.
In summary, there are a lot of powerful analytics tools out there which can help organizations transform their business by leveraging the power of data. However, there are several pitfalls as well that we need to be aware of and watch out for. By avoiding these problems, we can lead our organizations to get the best benefit out of using these tools.
Follow us on LinkedIn to get the latest posts & updates.