Карта сайта
Версия для печати

На пути к созданию единых стандартов по измерению качества данных (Часть 1)

26 июля 2013 В связи с расширением полномочий и увеличением ответственности стюардов данных, а также необходимостью использования методологий управления информацией и качеством данных на всех структурных уровнях организации, пришло время поговорить об утверждении единых стандартов, позволяющих создать набор метрик для измерения эффективности процессов управления информацией и бизнеса в целом. (Материал опубликован на английском языке)
As information management and data quality roles and responsibilities become more mainstream in large organizations there has been a call to agree upon standard categories of data quality. Malcolm Chisholm’s recent Information Management article suggests that there is no consensus regarding the dimensions of data quality. This complaint is not new. Yair Wand and Richard Y. Wang further argue that the expected value of dimensions of quality hasn’t been seen and even create a distraction.

Whether you call these categories “dimensions” [of data quality] or something else is a discussion for another time. I think Malcolm Chisholm’s proposal to call these “properties” makes a lot of sense, and I appreciate his ability to cut to the chase.

Having said that, I believe we’d be throwing out the baby with the bathwater if we dismissed the writing of multiple authors on the dimensions of data quality just because there isn’t a current consensus. Now is the right time for the data quality industry to finalize a set of standards, much like the accounting field has done with the Generally Accepted Accounting Principles. Every organization needs to have a defined set of measures of quality, which should be composed of industry standard dimensions. Each organization should then identify its unique needs for measurement. In this series of articles, I will document the level of consistency between six authors’ definitions of each of the dimensions of quality. The first of these, “accuracy,” is covered in this article.

In the capstone article, I will propose a conformed set of dimensions that incorporates the six authors’ definitions and my own experience.

It is my expectation that this article will speed up the industry’s rationalization of the dimensions of quality. This series of articles will answer two of the three challenges identified by Malcolm Chisholm.
  1. Define the concepts that compose the dimensions of quality and propose an alignment of the major contributor’s works, which is the first step to defining the dimensions themselves.
  2. Compile a thorough list of the underlying concepts of the dimensions of data quality, with the expectation that this work will cover the majority of all concepts.
When discussing the level of agreement on the dimensions of quality, consensus of definition should be measured within its intended scope. Dimensions of quality are most often implemented as a part of a broader data quality/governance effort and, as such, are determined and maintained within a given unit of authority, like the data governance board of an organization. There is authority given to them by the leadership of that organization and consensus is only required within that group (or within the data management roles across the company). This limits the scope of consensus building, making it feasible, compared to requiring consensus among all employees, companies, industries, etc. In this context, the dimensions may be considered principles to organize and direct change, rather than fixed laws, which would require stronger controls and global consensus.

The first challenge is to collect each author’s definition, and I have done so for six mainstream authors. I realize that every single contributor or author can’t be reviewed for this article, but, as Danette McGilvray pointed out, some authors (including herself) established dimensions of data quality not for the purpose of identifying the root concept and associated dimension, but rather established dimensions by type of method/technique of remediation. In the process, it is also helpful to reference these dimensions in context of the groupings they were explained by each respective author. Here are a few schemes for grouping all of the dimensions by author. Unfortunately I don’t have room within this article to compare each.
  • Tom Redman: Dimensions can be grouped by those having to do with a data model, data value or data presentation.
  • Larry English: Dimensions can be grouped by information content or information presentation.
  • David Loshin: Dimensions can be grouped as intrinsic, a measurement associated with data values themselves; contextual, in terms of relationship between records; qualitative, a synthesis of measures associated with intrinsic and contextual; or classifying.
  • Yang W. Lee, Leo L. Pipino, James D. Funk and Richard Y. Wang: Intrinsic IQ - accuracy, objectivity, believability and reputation; Accessibility IQ - accessibility and security; Contextual IQ - relevancy, value added, timeliness, completeness and amount of information; Representational IQ - interpretability, ease of understanding, concise representation and consistent representation.
Please note that there are other data quality subject matter experts that could be added to this list, including but not limited to: Arkady Maydanchik, Danette McGilvray, Jack Olson, Carlo Batini and Monica Scannapieco.

Rather than pick at semantic differences between each of the definitions listed in Table 1, let’s look at the conceptual similarities, which have been underlined. In Table 2, the three primary concepts that encompass accuracy have been identified with key quotes extracted from each author’s definition.

Concept similarity:
  • Five out of the six authors explicitly cite “agreement with the real world” as a component of accuracy.
  • Four of the six say that data should “Match To Agreed Source.”
  • Two authors include precision (the exactness of data, like the number of digits a number must include or if rounding is allowed).
If our goal is to identify consensus and disaggregate the concepts that overload this dimension, we could separate out “precision of the data” as its own dimension (as we see in Table 3 that three authors have done).

The goal of disaggregation is to make communication more precise and remove assumptions. So if a dimension isn't broadly known to include a particular concept, I suggest that, in theory, it is easier to remove this concept without changing the known meaning. As you can see above, if we break Precision out of Accuracy, then five out of six sources would agree with regard to a Precision dimension with the primary concept of “Precision of Data Value” (number of decimal places and rounding).

According to Tom Redman, the two apparent concepts unique to Accuracy are “Agree with Real World” and “Match to Agreed Source” because the former only works when there are physical objects/phenomena to observe, but in the case of events, an agreed upon source of record is usually needed. English puts it this way: “To measure Information Process Quality, you compare the sampled data to the Characteristics of the Real-World Object or Event that the data represents.”

Though a number of authorities cite correct sourcing as a component of data quality, not all cite it as a part of Accuracy, but rather as the primary concept within the Consistency dimension. One large insurance company effectively identified “sourcing” as a standalone dimension of data quality, which may work for your organization as well.

Although both Redman and English cite the correct source concept inside of accuracy, TDWI and DMBOK cite it within Consistency. I propose that we move this concept from the dimension of Accuracy and place it within Consistency, as the primary concept. This would give a majority (four out of six) agreement within this dimension as shown in Table 5.

It may be helpful at this point to note that by saying “correct source,” I mean the correct data system or file/table. Based on my review, I didn’t come to the conclusion that existence in reality is a source, but rather a separate concept. This implies that either I compare my data to real-world observation or to a data source — they are not the same thing, even though the data source may agree with the real-world observation.

English has another dimension titled “Source Quality & Security Warranties or Certifications,” composed of the following metrics.
  1. Guarantees Quality: Guarantees the quality of information it provides with remedies for non-compliance.
  2. Documents Certification: Documents its certification in its Information Quality Management capabilities to capture, maintain and deliver Quality Information.
  3. Provides Measures: Provides objective and verifiable measures of the Quality of Information it provides in agreed-upon Quality Characteristics.
  4. Guarantees Unauthorized Access: Guarantees that the Information has been protected from unauthorized access or modification.
Because this proposed dimension isn’t another concept, but rather detail for the consistency/sourcing concept, I’d move the first three of these measures into Consistency. The last one is a concept covered in the next article in this series, on the Accessibility dimension.

As is the case with a couple of the concepts within the dimensions of quality, data security and access controls, falls on the line or well within a discipline other than data quality. Information Security is a well-established domain with many more written works and established organizations, certifications, conferences, laws, standards and training curriculums than data quality. For this reason, many people acknowledge this data security concept, but in terms of areas of responsibility, elect to have separate dedicated IT security departments handle these aspects.

As seen in Table 5, although the last two authors (Loshin and Lee, et al.) don’t include the sourcing concept, they do bring additional value through their insights into the concept of Consistency in Representation:
  • Referential: Refers to the consistency of redundant data in one table or in multiple tables.
  • Logical/Structural: Consistency between two related data elements (e.g., city name and postal code).
  • Format: Consistency of format for the same data element used in different tables.
  • Semantic: Consistency of definitions among attributes within a data model.

The word “accuracy” regarding data quality is often used too broadly. For instance, if we receive a bill for services and it is understated by $5 for parts that were purchased in addition to the services, one might say that the bill is not “accurate.” From a data quality perspective though, this concept is referred to as completeness, where all the data needed for its intended use is not available. By using a word other than “accuracy” for this dimension, we avoid ambiguity and more effectively diagnose the problem.

Until now, we have clarified the term “accuracy” to mean “Agreement with the real-world,” but, because this word is so universally used – almost to the extent that it is synonymous with “quality.” After discussing this with Danette McGilvray, we agree that the industry should use something more distinctive that doesn’t mean so many things to everyone. Personally I find the word "Factualness" to represent the concept well.

  • Redman, Tom. "Data Quality: The Field Guide," Digital Press 2001.
  • English, Larry. "Information Quality Applied," Wiley Publishing, 2009.
  • TDWI. "Data Quality Fundamentals," The Data Warehousing Institute, 2011.
  • DAMA International. "The DAMA Guide to The Data Management Body of Knowledge" (DAMA-DMBOK Guide) Technics Publications, LLC, 2009.
  • Loshin, David. "The Practitioner's Guide to Data Quality Improvement," Elsevier 2011.
  • Yang W. Lee, Leo L. Pipino, James D. Funk, Richard Y. Wang. "Journey to Data Quality," MIT Press 2006.
  • McGilvray, Danette. "Executing Data Quality Projects- Ten Steps to Quality Data and Trusted Information," Morgan Kaufmann, 2008.