Data is the lifeblood of any business that wants to use fact-based decision making and leverage the power of data to transform the business. However, one of the biggest roadblocks for companies in making good decisions is the availability and quality of the right data. Data is usually stored either within the company in a data warehouse or in the cloud. Typically, data is stored within multiple systems that are used – for example SAP may store a lot of transactional data, Salesforce may store a lot of customer relationship management data, other company data may be stored within Oracle, AWS, Google servers, Azure, etc. Each process that the company uses such as hiring, logistics, accounting, marketing, operations, customer service, etc. generates a ton of data that is stored in various systems. In addition, the products/services are now also generating real-time data based on customer usage, customer interaction points, service requests, etc. which are also being stored. Ideally, all of this data can be mined to provide a competitive advantage for the company but unfortunately, the data available in multiple systems are not “perfect” and needs to be cleaned before it is used. It was estimated by Thomas Redman in a Harvard Business Review article that bad data costs the US companies around $3 trillion per year. Some of the ways in which poor data can cause losses are:
In marketing department, poor data can cause us to make incorrect assessments of marketing programs resulting in irrelevant offers and lost business.
Incorrect sales data may result in lost business by not properly leveraging the data a company has to convert a sale.
Error or lack of data within companies may cause companies to face penalties due to lack of adherence to compliance regulations.
Low quality operational data can cause us poor efficiency of process and also result in having to problems not being addressed promptly.
In healthcare, poor data can cause wrong diagnosis, incorrect medications, and poor patient outcomes.
In manufacturing department, bad data can cause delivery problems and delayed decision making to resolve issues causing products/services to be poor quality, longer lead-times, and higher costs.
Use of analysis tools such as Artificial Intelligence predictive models, are significantly hampered and they perform sub optimally when we provide bad quality data as inputs to the model.
Types of Data Problems
A good quality data is valid (meets the constraints defined for that data), accurate (is close to the true value), complete (does not have any missing data), consistent (between different data sets), and uniform (uses the same unit of measure). The following types of data issues are typically encountered:
Missing data – for example you are conducting a survey and all the required fields are not completely filled out. This is usually a problem when we are using systems that do not enforce the constraint that you cannot submit a form unless all required fields are completely filled out. This problem can also come about when we use legacy systems for capturing information where all the required fields were not mandated earlier but are now required in the newer systems.
Incorrect data – for example you are collecting data on customer email or physical address, but wrong information has been entered into the systems. This is a bit harder to catch since some text exists in these fields, but the text information is not correct. We can include basic checks to ensure this field has been entered correctly but that only solves part of the problem the user may still figure out a way to circumvent the issue. There can be a number of reasons for incorrect data such as:
Data entered incorrectly by the customer/supplier/data entry operator.
Data obtained from an instrument that has measurement error and is not accurate.
Data that has been incorrectly translated from another application/source.
Data stored in various systems are not uniform and use different standards.
Inconsistent data – this is similar to the earlier case of incorrect data fields, but this problem arises when similar data are stored in different tables/databases. If the data contained in them do not agree with each other, it will be hard to determine which data to use. Maybe we can have a date stamp and use the information that has been updated last. Ideally, data should be stored uniquely and linked to other databases/tables as required to avoid duplication and maintain one master for the data set.
Insufficient data – the data collection has not been designed to collect all the data required for analysis. Maybe the data was being collected for other reasons or uses historically, but it is not sufficient to expand the usage to other areas/applications. If the required data is not being collected it may be difficult to use part information to make good decisions.
An example table with missing data, duplicate data and incorrect data is shown in the figure below.
Best Practices for Data Cleaning
Use the following best practices to clean the data prior to use:
Create a data collection plan for the company – what type of data is required to be stored, why is the data required, who is responsible for collecting the data, where is the data to be stored, what is the operational definition for the data, how much data is to be stored, etc. This is an important document and can be used to determine the gaps between what data is currently being compiled and what is required. There can be two types of issues we may encounter – the data required for analysis is not available or the data that is currently being collected is not really being used. Both of these problems can be avoided by having a proper data collection plan.
Ensure that the data is being stored in the right data type/format. For example, continuous data can be translated to discrete data later on, but discrete data cannot be translated to continuous data. So, make sure that the right data types are being used and try to store the original data as much as possible.
One of the biggest sources of data error is at the point of data collection. We need to make sure that the people who are performing data entry or products that are automatically providing the data at the source are as accurate as possible. If the entry point of the data itself is bad, no amount of massaging of the data later can help. Make sure that you are entering the right data into the systems. Are the standards clear? Are there measurement systems analysis checks being performed to ensure that the data is of high quality?
Remove all duplicate data storage points. You need to check the data storage architecture within your company and come up with unique data storage locations. For other areas that require the data make sure that you link to it from the master using relational databases. Having a single master/source of truth makes it easier to fix the data at one location and all the users of the data would then use the right information.
The next biggest source of data error is in processing of the raw data and reporting of the data. Check all the algorithms that are being used for data processing. If there are errors in processing the data, such as incorrect units, translation errors due to incorrect mapping of the fields etc., then the wrong data will be fed into our analysis algorithms. We need to ensure that there are no translation errors where the data is being used.
The above steps will ensure that all the data that is currently being collected are of high quality however, companies will also have a ton of legacy data that was collected in the past. Spend the resources (time, money, effort) to fix existing data that has already been collected to ensure that all problems with the company is of high quality. This may be especially important if you are using legacy systems.
Develop appropriate metrics to report data quality issues throughout the organization with proper roles and responsibilities and accountabilities to fix problems when they occur. This will ensure that the data quality metrics are brought to the leadership attention and if the data quality metrics are dropping over time, then appropriate actions can be taken to address the problem areas before it is too late.
Keep a track of all the data quality problems you have encountered and perform a root cause analysis for all the errors encountered. You can use tools like decision trees, 5 why analysis, Is-Is Not analysis etc. to try and understand the causes of the data errors. Once we understand the root causes, we can put a fix so that future data quality can be improved.
Tools for Data Cleaning
If the data size is small, you can use some of the standard functionality available within Excel to review your data and fix any issues in your data. Sigma Magic software has some standard functionality which can also be used to handle missing or non-standardized data. For missing data, you can either delete those data points or replace the missing values with the best estimated values from the remaining data sets. If the data is not standardized you can standardize the data.
There are a number of tools available that can also be used to cross-check your data between different tables. Some of the tools can also verify the data you have stored internally within the company with publicly available information. Some of the checks that can be performed are: missing fields, incorrect data types, invalid characters, validity of the data (for example, end dates should be greater than start dates), validation of data that is stored with external sources (for example match between zip code and state), etc.
These tools can be run on a regular interval and provide reports on data quality that can be published to the relevant stakeholders. As an example, the following screenshot shows a view of the TIBCO Clarity tool where you can define rules for data quality checking for each of the fields which the software will validate on your database.
A few of the tools that can be used for data cleaning are listed below in alphabetical order. Note that some of these are freeware tools while others are paid commercial software.