На пути к созданию единых стандартов по измерению качества данных: полнота информации (Часть 3)

9 августа 2013 В предыдущих двух статьях мы изучили 6 основных критериев оценки качества данных: корректность, точность, последовательность, достоверность, своевременность и доступность. Сегодня давайте рассмотрим фундаментальное свойство данных, несоответствие с которым способно ввести пользователя в заблуждение – полнота информации. Читаем! (Материал опубликован на английском языке)

At a high level, Completeness is intuitive. The key to measuring Completeness (or anything in this world, for that matter) is to identify the data’s characteristics and then compare those known attributes at a later time to test whether they have changed, in this case whether they have changed from NULL to NOT NULL or vice versa.

The following illustration of a delimited file transmitted from one system to another shows the two primary types of completeness: row-level and column-level.

As you can see in the illustration, the file has four physical rows composed of three data rows and a header. Although you can calculate the number of rows as the literal four rows, we usually exclude the header from any counts or amounts to avoid confusion. If I included the header in the count/amount for each column in this file, how would I include ‘ITEM_PRICE’ in the sum of amounts ($15.25, $33.12, $24.95)?

When data is moved from one location to another, Completeness is a concern. In order to identify the loss of data, we need two measures of completeness for a two dimensional data set (e.g., table of data).

Row-level Completeness: First and most importantly, to validate completeness we measure the count of physical rows in the file and then measure that again after we move the data to make sure we have all the observations.
Column-level Completeness: Next, we count the values of each column or aggregate the values (if logically able), as seen in the ITEM_PRICE column in the previous illustration. This ensures that, on the whole, the data is the same.

Unlike the other dimensions covered so far in this series of articles, there is complete agreement with the primary concept of this dimension: column-level population.

All six authors include column-level population within their definitions. This is one of the key measures of data profiling tools (e.g., Null counts and percentages). Other common examples of mechanisms to control completeness are database constraints that enforce null-ability and form validation implemented within the application, typically using JavaScript.

In the financial services industry, row completeness is also very important (e.g., if a transaction was not recorded/moved, all associated revenues/premiums are likely understated). A consolidated list of concepts within the Completeness dimension must have the column and row levels, but in my experience, schema and table levels aren’t frequently measured.

References:

Redman, Tom. "Data Quality: The Field Guide," Digital Press 2001.
English, Larry. "Information Quality Applied," Wiley Publishing, 2009.
TDWI. "Data Quality Fundamentals," The Data Warehousing Institute, 2011.
DAMA International. "The DAMA Guide to The Data Management Body of Knowledge" (DAMA-DMBOK Guide) Technics Publications, LLC, 2009.
Loshin, David. "The Practitioner's Guide to Data Quality Improvement," Elsevier 2011.
Yang W. Lee, Leo L. Pipino, James D. Funk, Richard Y. Wang. "Journey to Data Quality," MIT Press 2006.
McGilvray, Danette. "Executing Data Quality Projects- Ten Steps to Quality Data and Trusted Information," Morgan Kaufmann, 2008.

Source: information-management.com

Теги: DG, DQ, MDM, Интересное, Мир