На пути к созданию единых стандартов по измерению качества данных: описание и представление данных (Часть 5)

23 августа 2013 В 4 предыдущих статьях автор детально описал основные критерии оценки качества данных: достоверность, своевременность, доступность, полнота информации, валидность и непротиворечивость. Последние 2 критерия, часто используемые при оценке качества информации - это описание (Definition) и представление (Representation) данных. (Материал опубликован на английском языке)

Often data quality issues are not about the transformation of the data, but rather the awkward or misleading definitions. More often than not there are no descriptions or report captions at all.

As one of my co-workers pointed out, the challenge isn’t so much about data quality as it is about educating people regarding what the data means and how to use it — concepts tightly related to Definition and Representation. Sometimes a new training program is the solution to removing the impression that data is of poor quality/fitness.

These dimensions present a challenge because of the similarity between the two, so let’s first review the concepts presented by the six data quality authors discussed in this series for each area and then normalize what we find. Three authors identify the “Definition” dimension, but because we already covered the “Values Consistent with Definition” concept proposed by Loshin and English in the Consistency dimension, we only need to deal with two concepts provided by Redman:

Clear, easy to understand definition.
Includes measurement units.

The two concepts identified by Redman seem to fit well within the Representation dimension because these definitions are logically a subcategory of Representation. Table 1 outlines the concepts within the “Representation” dimension cited by three authors.

Adding Redman’s Definition concepts to the Representation dimension and removing the Accessibility dimension cited by TDWI (which we already addressed in the Accessibility dimension in part 2 of this series) provides us with a comprehensive new Representation dimension as seen in Table 2. Redman also goes further, introducing the dimension titled Relevance: “Data are relevant to a particular task or decision if they contribute to the completion of that task or making of the decision” (Redman, 226). One may propose including this concept within Representation, but this dimension is outside the scope of DQ, defined as “fitness for use,” because if data isn’t meant for use then it will not be relevant.

Another authority in the data quality space, Danette McGilvray, also adds a “Data Specifications” dimension (defined as the measure of the existence, completeness, quality and documentation of data standards, data models, business rules, metadata and reference data) that, “…provides the standard against which to compare data quality assessment results. They also provide instruction for manually entering data, designing data load programs, updating information, and developing applications” (McGilvray, 31).The definition (Representation), format and derivation (Validity), data load and data model (Integrity) have been covered in previous articles in this series. Having said that, integration of data quality standards within IT requirements is absolutely critical to success, but it has to be done per stakeholder group because fitness for use may differ by consumer or even business process.

I suspect that the DMBOK authors opted not to called “Representation” out explicitly as a dimension of quality because a whole chapter of the DMBOK is devoted to metadata management. Also, Lee and Wang cite a category of information quality, “Representational IQ” in their 1997 paper titled “10 Potholes in the Road to Information Quality” where they list “Interpretability, Ease of Understanding, Concise Representation, Consistent Representation,” so there is more agreement than may appear.

Loshin added another dimension that we haven’t covered, called Lineage. He defines it as the “Originating data source,” with the additional clarification: “All data elements will include an attribute identifying its original source and date. All updated data elements will include an identifier for the source of the update and a date. Audit trails of all provenance data will be kept and archived” (Loshin, 136).

In a broader sense, lineage can imply the collection of all metadata about where data came from and how it was transformed along the way. Until now, I have normalized dimensions recommended by each author, primarily taking a majority-consensus approach. Regarding lineage, where only Loshin calls it out, I agree that even though only one author has identified it, we should include it in an industry set of DQ dimensions. In an interview with Tom Redman, he also agrees that this is a unique contribution that Loshin brought to the field.

Lineage is a valid dimension of data quality because:

Using the concepts of lineage identify risk not included in other dimensions of data quality.(For instance, a large number of segments/transformations increases the risk that the data was incorrectly changed. This helps practitioners measure and prioritizes DQ issues.)
Using the concepts of lineage identify cost not included in other dimensions of data quality.(As an example, various stakeholders may consume the same sales data, deriving it in tens of unique ways, reducing consistency and increasing complexity, which are IT cost drivers.)

Lineage, like other dimensions of quality, can be used in conjunction with other dimensions to add value. For example, by providing a lineage of the data from end to end with embedded completeness measures for each segment, one can evaluate the total completeness, inclusive of all movement and transformation.

In order to flesh out this dimension, I have outlined the concepts found and a few attributes, as noted below. Figure 1 will look very familiar to those familiar with ETL processes used in data warehousing, illustrating the beginning to end data flow.

Concepts:

1) Segment: Movement or transformation of data having a beginning point, called a source, and an end-point, called a target.

Attributes:
a) Derivation code (e.g., SQL, RegEx … etc.)
b) Derivation description (typically pseudo-code/simple fragment plain English)
c) Derivation type (e.g., pass through versus derived)

2) Source: Beginning point of data movement or transformation.

Attributes:
a) Source level [e.g., primary source (1st), secondary source (2nd) … nth]
b) Source system name
c) Source type or technology

3) Target: End point of data movement or transformation.

Attributes:
a) Target level (see source level)
b) Target system name
c) Target type or technology

4) End-to-end: The multisegment definition of data movement or transformation, inclusive of all intermediate segments to provide data.

Attributes:
a) Total number of segments
b) Average {dimension of DQ} (e.g., Average Completeness for three segments)
c) Certification (measure of how thoroughly systems integration testing has been conducted)

In conclusion, we can normalize Definition into the Representation dimension listing the five concepts (1. Easy to Read and Interpret, 2. Presentation Language, 3. Media Appropriate, 4. Includes Measurement Units, and 5. Complete and Available Metadata). In addition, I added reasons why Loshin’s Lineage dimension should be included within the industry list of DQ dimensions, providing concepts and example attributes. The next article (the last in this series) will compile each of my recommendations into a single industry-standard list with basic definitions and concepts.

All references to authors’ works come from the following sources:

Redman, Tom. "Data Quality: The Field Guide," Digital Press 2001.
English, Larry. "Information Quality Applied," Wiley Publishing, 2009.
TDWI. "Data Quality Fundamentals," The Data Warehousing Institute, 2011.
DAMA International. "The DAMA Guide to The Data Management Body of Knowledge" (DAMA-DMBOK Guide) Technics Publications, LLC, 2009.
Loshin, David. "The Practitioner's Guide to Data Quality Improvement," Elsevier 2011.
Yang W. Lee, Leo L. Pipino, James D. Funk, Richard Y. Wang. "Journey to Data Quality," MIT Press 2006.
McGilvray, Danette. "Executing Data Quality Projects- Ten Steps to Quality Data and Trusted Information," Morgan Kaufmann, 2008.

Source: information-management.com

Теги: DG, DQ, MDM, Интересное, Мир