Карта сайта
Версия для печати

На пути к созданию единых стандартов измерения качества данных: валидность и непротиворечивость (Часть 4)

16 августа 2013 Встречали ли Вы когда-нибудь выражение «статистическая величина валидна (обоснована) и не противоречива»? Или, возможно, Вам приходилось спорить с сотрудником ИТ-департамента относительно того факта, что даже если показатель не находится в списке допустимых значений, он все равно правильный (фактический)? В данной статье предлагаем расширить список критериев, детально описанных в предыдущих статьях из серии, и рассмотреть очередные критерии оценки данных: валидность (Validity) и непротиворечивость (Integrity) данных. (Материал опубликован на английском языке)


As discussed in the first article, there is relative agreement on the Accuracy dimension, but there is some confusion around the Validity dimension, which is distinctly different. Although people often use the words valid or invalid when they are expressing whether data is factual or not, the words hold different implications when considered in data management/quality context.

The question at the beginning of this article referred to situations where a value can be valid (within a set of predefined accepted values), like “CA” within the list of U.S. state abbreviations, but inaccurate (not factual). One example may be a piece of mail that is intended for a destination in Alaska, but is mistakenly addressed with “AL” (Alabama).

Conversely, many advanced systems now check that a value is within a set of specified valid values and report errors (or even automatically correct the mistake based on some default logic). In this scenario, a factual value may be rejected if the system doesn’t have that value within its list of expected values. An example of this may be an insurance policy processing system that rejects a homeowner’s address in a state that the insurer didn’t (until very recently) conduct business. Once the system’s list of valid states has been updated with the new state value, the entry would be factual and recognized as valid.

As shown in Table 1, there is some consensus on the concepts in this dimension, with some focus on “Values in Specified Range of Valid Values” concept. Loshin places this within Accuracy, and Lee et al. place it within Integrity (see Table 1), but Loshin’s April 1, 2011 blog post, implies his agreement that data validity and data correctness are different concepts.As discussed in the first article, there is relative agreement on the Accuracy dimension, but there is some confusion around the Validity dimension, which is distinctly different. Although people often use the words valid or invalid when they are expressing whether data is factual or not, the words hold different implications when considered in data management/quality context.

The question at the beginning of this article referred to situations where a value can be valid (within a set of predefined accepted values), like “CA” within the list of U.S. state abbreviations, but inaccurate (not factual). One example may be a piece of mail that is intended for a destination in Alaska, but is mistakenly addressed with “AL” (Alabama).

Conversely, many advanced systems now check that a value is within a set of specified valid values and report errors (or even automatically correct the mistake based on some default logic). In this scenario, a factual value may be rejected if the system doesn’t have that value within its list of expected values. An example of this may be an insurance policy processing system that rejects a homeowner’s address in a state that the insurer didn’t (until very recently) conduct business. Once the system’s list of valid states has been updated with the new state value, the entry would be factual and recognized as valid.

As shown in Table 1, there is some consensus on the concepts in this dimension, with some focus on “Values in Specified Range of Valid Values” concept. Loshin places this within Accuracy, and Lee et al. place it within Integrity (see Table 1), but Loshin’s April 1, 2011 blog post, implies his agreement that data validity and data correctness are different concepts.



The process of doing the research and writing this article  has been rewarding for me because, as I suspected, knowledge and agreement improve as authors discuss the concepts and consider the best way to express concepts. In discussion with Tom Redman prior to publishing this article, he observed that the concepts of Validity were named Consistency in his book, but now he prefers the term Validity.

So now that we’ve discussed how we might normalize Validity, let’s turn to Integrity (Table 3). Coming from a data modeling background, I find this dimension the most straightforward and common-sense orientated. I have found that IT departments are better equipped to measure and remedy these Integrity concepts, unlike valid value/reference data management often required of business subject matter experts done during validity activities.



As seen in Table 4, moving the “Values in Specified Range of Valid Values” (Domain) concept into the Validity dimension allows us to focus on relational concepts pure to Integrity. Four of the six authors reference “Referential Integrity,” with some going further into similar components (basically the tenets of E.F. Codd’s database normalization).

I am unsure why there isn’t greater agreement among authors relative to the concepts that are included within Integrity. I suspect that there is a general assumption that these are done through the data modeling process and, therefore, aren’t explicitly called out here. Most data profiling tools offer functionality to ensure these concepts. If you are in the market to purchase a profiler, I recommend you validate that the vendor solution sufficiently provides this capability.



At this point, it should be noted that many authors call out Unwanted Duplication as a separate dimension. The equivalent concept covered by these six authors is named “Unique Identifier of Entity” in Table 4. I believe that because all of Codd’s tenets of normalization can be identified within one dimension named Integrity, we don’t need a distinct dimension for Duplication. Furthermore duplication, as a concept, isn’t always a data quality problem because sometimes data solutions intentionally allow for a level of Intended Duplication, but which still have unique identifiers (surrogate keys) to improve query performance.



References:

  • Redman, Tom. "Data Quality: The Field Guide," Digital Press 2001.
  • English, Larry. "Information Quality Applied," Wiley Publishing, 2009.
  • TDWI. "Data Quality Fundamentals," The Data Warehousing Institute, 2011.
  • DAMA International. "The DAMA Guide to The Data Management Body of Knowledge" (DAMA-DMBOK Guide) Technics Publications, LLC, 2009.
  • Loshin, David. "The Practitioner's Guide to Data Quality Improvement," Elsevier 2011.
  • Yang W. Lee, Leo L. Pipino, James D. Funk, Richard Y. Wang. "Journey to Data Quality," MIT Press 2006.
  • McGilvray, Danette. "Executing Data Quality Projects- Ten Steps to Quality Data and Trusted Information," Morgan Kaufmann, 2008.

Source:  information-management.com