SUBJECT: DATA EDITING AND IMPUTATION OF ITEM NONRESPONSE
NCES STANDARD: 4-1
PURPOSE: To establish guidelines to reduce potential bias, ensure consistent
estimates, and simplify analysis, by substituting values for missing
(i.e., imputation) or inconsistent
data in a data set (i.e., edits).
KEY TERMS: cross-sectional,
cross-sectional imputations,
cross-wave imputations, edit,
freshened sample, imputation, item
nonresponse, key variables, longitudinal,
nonresponse bias, response
rate, stage of data collection,
and universe.
STANDARD 4-1-1: Prior to imputation
the data must be edited. Data editing
is an iterative and interactive process that includes procedures for
detecting and correcting errors in the data. Data editing must be repeated
after the data are imputed, and again after the data are altered during
disclosure risk analysis. At each
stage, the data must be checked for:
- Credibility based on range checks to determine if all responses fall within
a prespecified reasonable range.
- Consistency based on checks across variables within individual records for
noncontradictory responses and for correct flow through prescribed skip patterns.
- Completeness based on the amount of nonresponse and involves efforts to fill
in missing data directly from other portions of an individual's record.
STANDARD 4-1-2: Key variables
in data sets used for cross-sectional
estimates must be imputed (beyond overall mean imputation). This applies
to cross-sectional data sets
and to data from longitudinal
data sets that are used to produce cross-sectional
estimates (i.e., base year and subsequent freshened
samples). (See Appendix B for a discussion of alternative imputation
procedures, including the pros and cons of specific approaches).
GUIDELINE 4-1-2A: In census (universe)
data collections, it may not be appropriate to impute data in certain
situations (e.g., peer analysis situations or when data for a particular
establishment-school, university, or library-are being examined individually).
GUIDELINE 4-1-2B: When using non-NCES data sets, it is desirable to
impute for missing data in those items being used in NCES publications.
This is only appropriate when adequate auxiliary information is available.
GUIDELINE 4-1-2C: Imputation procedures
should be internally consistent, be based on theoretical and empirical
considerations, be appropriate for the analysis, and make use of the
most relevant data available. If multivariate analysis is anticipated,
care must be taken to use imputations that minimize the attenuation
of underlying relationships. The Chief Statistician should review imputation
plans prior to implementation.
STANDARD 4-1-3: In the case of longitudinal data sets, two imputation
approaches are acceptable: cross-wave
imputations or cross-sectional
imputations. Cross-wave imputations may be used to complete missing
data for longitudinal analysis or cross-sectional imputations may be
used. (Guideline 4-1-2C of this Standard applies here, as well.)
STANDARD 4-1-4: In those cases where a nonresponse
bias analysis shows that the data are not missing at random, the
amount of potential bias must inform the decision to retain or delete
individual items (see Standard 4-4).
STANDARD 4-1-5: In cases where imputation
is not used (e.g., items that are not key
variables in either cross-sectional or longitudinal
analysis), data tables must include a reference to a methodology table
or glossary that shows the actual weighted response rates for each unimputed
variable included in the report (see Standard 1-3 for the item response
rate formula). For individual variables with item response rates less
than 85 percent, the variable must be footnoted in the row or column
header. The footnote must alert readers to the fact that the response
rate is below 85 percent and that missing data have not been explicitly
accounted for in the data.
STANDARD 4-1-6: When imputations are used, documentation indicating
the weighted proportion of imputed data must be presented for all published
estimates based on NCES data. Information about the amount of imputed
data in the analysis can be included in the technical notes and does
not have to accompany each table. The range of the amount of imputation
used for the set of items included in an analysis must be reported.
Also, the amount of imputation must be reported for items with response
rates less than 70 percent. Items with response rates lower than
70 percent must be footnoted in the tables.
STANDARD 4-1-7: All imputed values on a data file must be clearly
identified as such.
GUIDELINE 4-1-7A: Imputed data should be flagged in associated "flag"
fields. The imputation method should be identified in the flag. Blanks are not legitimate values for
flags.
STANDARD 4-1-8: If nonimputed items are used in the estimation
of totals or ratios (as in Standard 4-1-3 above), the risks of not using
imputed data must be described.
- Estimated totals using nonimputed data implicitly impute a zero value for
all missing data. These zero implicit imputations will mean that the estimates
of totals will underestimate the true population totals. Thus, when reporting
totals based on a nonimputed item, the response rate for that item must be footnoted
in the data table.
- Ratios (averages) using nonimputed data will implicitly impute the cell ratio
for all missing data within the cell. This can cause inconsistencies in the
estimates between tables.