In the age of transformation, all successful companies collect data, but one of the most expensive and difficult problems to solve is the quality of that information. Data analysis is useless if we don’t have reliable information, because the answers we derive from it could deviate greatly from reality. Consequently, we could make bad decisions.
Most organizations believe the data they work with is reasonably good, but they recognize that poor-quality data poses a substantial risk to their bottom line. (The State of Enterprise Quality Data 2016 – 451 Research)
Meanwhile, the idiosyncrasies of Big Data are only making the data quality problem more acute. Information is being generated at increasingly faster rates, while larger data volumes are innately harder to manage.
Data quality challenges
There are four main drivers of dirty data:
- Lack of knowledge. You may not know what certain data mean. For example, does the entry “2017” refer to a year, a price ($2,017.00), the number of widgets sold (2,017), or an arbitrary employee ID number? This could happen because the structure is too complex, especially in large transactional database systems, or the data source is unclear (particularly if that source is external).
- Variety of data. This is a problem when you’re trying to integrate incompatible types of information. The incompatibility can be as simple as one data source reporting weights in pounds and another in kilograms, or as complex as different database formats.
- Data transfers. Employee typing errors can be reduced through proofreading and better training. But a business model that relies on external customers or partners to enter their own data has greater risks of “dirty” data because it can’t control the quality of their inputs.
- System errors caused by server outages, malfunctions, duplicates, and so forth.
Dealing with dirty data?
Correcting a data quality problem is not easy. For one thing, it is complicated and expensive; benefits aren’t apparent in the short term, so it can be hard to justify to management. And as I mentioned above, the data gathering and interpretation process has many vulnerable places where error can creep in. Furthermore, both the business processes from which you’re gathering data and the technology you’re using are liable to change at short notice, so quality correction processes need to be flexible.
Therefore, an organization that wants reliable data quality needs to build in multiple quality checkpoints: during collection, delivery, storage, integration, recovery, and during analysis or data mining.
The trick is having a plan
Monitoring so many potential checkpoints, each requiring a different approach, calls for a thorough quality assurance plan.
A classic starting point is analyzing data quality when it first enters the system – often via manual input, or where the organization may not have standardized data input systems. The risk analyzed is that data entry can be erroneous, duplicated, or overly abbreviated (e.g. “NY” instead of “New York City.)” In these cases, data quality experts’ guidance falls into two categories.
First, you can act preventively on the process architecture, e.g. building integrity checkpoints, enforcing existing checkpoints better, limiting the range of data to be entered (for example, replacing free-form entries with drop-down menus), rewarding successful data entry, and eliminating hardware or software limitations (for example, if a CRM system can’t pull data straight from a sales revenue database).
The other option is to strike retrospectively, focused on data cleaning or diagnostic tasks (error detection).
Experts recommend these steps:
- Analyzing the accuracy of the data, through either making a full inventory of the current situation (trustworthy but potentially expensive) or examining work and audit samples (less expensive, but not 100% reliable).
- Measuring the consistency and correspondence between data elements; problems here can affect the overall truth of your business information.
- Quantifying systems errors in analysis that could damage data quality.
- Measuring the success of completed processes, from data collection through transformation to consumption One example might be how many “invalid” or “incomplete” alerts remain at the end of a pass through of the data.
Your secret weapon: “Data provocateurs”
None of this will help if you don’t have the whole organization involved in improving data quality. Thomas C. Redman, an authority in the field, presents a model for this in a Harvard Business Review article, “Data Quality Should Be Everyone’s Job.”
Redman says it’s necessary to involve what he calls “data provocateurs,” people in different areas of the business (from top executives to new employees), who will challenge data quality and think outside the box for ways to improve it.
Some companies are even proposing awards to employees who detect process flaws where poor data quality can sneak in. This not only cuts down on errors, it has the added benefit of promoting the idea through the whole company that clean, accurate data is important.
Organizations are rightly concerned about data quality and its impact on their bottom line. The ones that take measures to improve their data quality are seeing higher profits and more efficient operations because their decisions are based on reliable data. They also see lower costs from fixing errors and spend less time gathering and processing their data.
The journey towards better data quality requires involving all levels of the company. It also requires assuming costs whose benefits may not be visible in the short term, but which eventually will end up boosting these companies’ profits and competitiveness.