Data Quality Explained – The 4 core dimensions of data quality
One of my favorite quotations of all time is the following:
On two occasions, I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
— Charles Babbage, Passages from the Life of a Philosopher
In computer lingo, this refers to GIGO:
Garbage in, garbage out (GIGO) in the field of computer science or information and communications technology – it refers to the fact that computers, since they operate by logical processes, will unquestioningly process unintended, even nonsensical, input data (“garbage in”) and produce undesired, often nonsensical, output (“garbage out”).
The idea is clear, but when you look at data what is garbage and what is good data?
I have chosen what I feel is the most important 4 metrics of data quality from the 6 offered by the DAMA UK Working Group.
For each of the metrics I will give a scenario (I will try to avoid computer speak) to show how each metric is relevant.
You have a data set from your Payroll system which describe your work force. Before using the data set, did you consider whether all your employees will be included in the data set? From Executives thought to temporary staff, or are you perhaps running more than one payroll? Or perhaps, your different operations each run their own payroll.
Ok, so you have the data but did you know that your HR staff does not bother to fill in address details of employees? Information that is not crucial to successfully process month end payroll are often not validated for example, the home address of employees is left vacant or even worse, guessed at.
Completeness in essence is whether all the available data fields are populated.
You have gathered your set of data describing your work force and after making sure its complete you realize that there are data fields missing and you can get those from the HR system and not the Payroll system. So, you think a simple merge will do the job, but both data sets contain the id number of the individual, but would you believe it, they differ!! What now?
Which version of the truth is the real-truth or is it a mix?
You would like to do some stats on where your workforce
lives. Great, the Payroll did not have address data but you got address in the HR data. You do the stats and all looks good until you find out that the address in the HR system is only captured on engagement and never updated again.
Data that is not updated on a regular basis is worse than having no data because your reports turn out to be a lie. Each data element has its own time sensitivity, addresses might be good enough if they are updated once a year but how much tax the employee paid needs to be done each month.
Now you need to do some stats on the educational level of your workforce per MQA definitions. Great, educational level is in your data set, you group it and oops, you have alpha numeric codes and numeric codes in there, what happened. Well the long and the short of it was that the HR system does not validate data entries. After a lot of data scrubbing you realize that the data speaks to two different sets of MQA levels because they changed over time.
Data definitions and the implementation of them needs to be maintained across time and validity should be enforced at point of entry.
I used simple scenarios to explain the quality dimensions, often it is simple to solve the data issues described.
But, and this is a big but, the ideal way to solve data issues is through systems and automation of processes. This however is going to upset someone in your organization because people don’t like change and doing it right is going to cost money.
Usually organizations settle for fixing small system things like adding validation to input screens and tweak processes to try and help people to capture/record the data better. This will work if retraining people at their jobs can easily be rolled out in your organization, or if your employees are open to improvements and advice. You also need to make sure that process knowledge and enforcement don’t go out the window with the inevitable employee turnover.
In a future article, I will introduce our business intelligence system, Insite and how you could use it to identify issues with your data. Insite was designed to highlight and address data related challenges associated with completeness, uniqueness, timeliness and validity.
December 7, 2016 | By Phil Marneweck (CTO at MTS)