Quality is a complex, yet critical theme in data production and use. Individuals using data for organizational decisionmaking, program evaluation, or research must understand the quality of the information they rely upon. A host of related concepts, including a wide range of quality metrics, are often used as metadata for assessing and tracking the quality of a data element or data set. One measure that directly assesses a data set's quality is identity, which is used to determine whether every "item" (e.g., a person, place, concept, or event) is uniquely identifiable and distinguishable from all other entities in a data set. Identity analysis frequently addresses the following types of issues:
Accuracy and reliability are also directly related to data quality. Accuracy metrics determine the extent to which data measure what they purport to measure without bias. In other words, how well do the data correspond to the process or product being assessed? For example, an accuracy metric could help determine whether an exam assesses academic performance without introducing bias. Reliability, on the other hand, refers to the consistency, reproducibility, and dependability of the data. If the same item were measured multiple times, would the same results be generated? Reliability may reflect uncertainty in a measurement tool or the amount of random error naturally present in the data.
Completeness measures the degree to which required records and values exist in a given data set. For example, if individual student records are being transferred, the record set is considered "complete" when a unique record exists for each student in the group; if there are 200 students, the record set is complete if there are 200 unique records. Similarly, if there are 50 mandatory items or fields in each individual student record, a record is complete when each of the 50 fields has an entry. Because completeness is determined by having an entry in each field, all data items must be completed unless a skip pattern (or similar tool) is used for items that need not be completed.
The inverse measure of completeness is the concept of sparsity. This refers to a measure of a lack of data when, for example, only four of nine required fields are available. When data are too sparse, assessing what they mean becomes difficult.
Value set testing examines the content of data fields to ensure that each data value falls within the expressed domain of allowable values. Allowable values (e.g., the age of all students in an elementary grade level must be between the values of five and twelve) are often based on business rules and other guidelines and standards expressed in metadata. The frequency or rate of domain violations and percentage of defective values are the most common measures of value set integrity. Coherence complements value set testing by providing a measure of value conflicts across related data sets. In other words, not only do data fall within a range of allowable values (value set testing), but data that should be identical in different data sets are indeed the same. For example, are student counts on an annual enrollment collection consistent with student counts in an annual dropout report?
Another facet of data quality is continuity analysis, which typically is performed to confirm a consecutive, non-overlapping, and unbroken history of the events represented by the data. For example, continuity analysis might assess whether daily membership data are available for each school day (and, in fact, only once for each day) in an academic year prior to generating an average daily membership for the entire year. If average daily membership were to be generated for each grading period, these data would need to be available consecutively from the first through the last school day of the grading period. Common continuity measures include the ratio of entities with a defective history to those with a defect-free history. More complex measures examine the size of the gap or overlap when defects occur.
Contiguity testing further assesses the logical sequencing of data in a data set. For example, contiguity measures might be used to assess whether the date a student passes the state's exit exam always occurs prior to the date of graduation. Contiguity evaluation generally is based on business rules, as well as other guidelines and standards expressed in metadata, to define the logic against which data are assessed. Typical contiguity measures include the ratio of entities with a defective history to entities with a defect-free history. More complex measures examine the frequency with which particular steps in a required sequence are skipped or recorded out of order.
Currency refers to the age or "freshness" of the data—that is, how "current" it is. Currency usually represents the time difference between the present date and the date when data were entered in the database. It often is measured in terms of the gap (the number of hours, days, months, or years) between the current date and the date of the most recent data available. This type of information is most important when great changes in data values can occur over short periods of time, or when data are used routinely but not collected very frequently. The effect on the end user can be significant—for example, a user should know if the "latest" enrollment data were collected eight months previously.
An extension of currency is punctuality, which is a measure of how quickly access is provided to recent data. For example, if student addresses are updated in May, when are they available to the transportation office for planning the following school year's bus routes? Punctuality is sometimes referred to as timeliness (are the data available for use when needed) and may also be used to establish schedules that describe when new data can be expected. The punctuality measure may vary for the same set of data depending on audience type; for example, a data set may be available for internal planning purposes more quickly than for external reporting.
Data verification is the practice of confirming that data are accurate, and data validation refers to the practice of confirming that data agree with expectations of reasonable values and accepted norms. These are integral concepts in the production of quality data (see exhibit 3.4). Metadata can document the results of various statistical and procedural techniques used to verify and validate data. These include response and documentation audits, such as an examination of records that substantiate data submitted by a respondent; cross-checks, which refers to the practice of "checking" data from different collections for consistency; and value edits, which, for example, can compare entered data to maximum or minimum expected values. Exhibit 3.3 provides a real world example of how these metadata concepts might be applied to a data element in a metadata system.