Skip Navigation

Statistical Standards
Statistical Standards Program
 
Table of Contents
 
Introduction
1. Development of Concepts and Methods
2. Planning and Design of Surveys
3. Collection of Data
4. Processing and Editing of Data

 
4-1 Data Editing and Imputation of Item Nonresponse
4-2 Maintaining Confidentiality
4-3 Evaluation of Surveys
4-4 Nonresponse Bias Analysis

5. Analysis of Data / Production of Estimates or Projections
6. Establishment of Review Procedures
7. Dissemination of Data
 
Glossary
Appendix A
Appendix B
Appendix C
Appendix D
 
Publication information

For help viewing PDF files, please click here
PROCESSING AND EDITING OF DATA

SUBJECT: DATA EDITING AND IMPUTATION OF ITEM NONRESPONSE

NCES STANDARD: 4-1

PURPOSE: To establish guidelines to reduce potential bias, ensure consistent estimates, and simplify analysis, by substituting values for missing (i.e., imputation) or inconsistent data in a data set (i.e., edits).

KEY TERMS: cross-sectional, cross-sectional imputations, cross-wave imputations, edit, freshened sample, imputation, item nonresponse, key variables, longitudinal, nonresponse bias, response rate, stage of data collection, and universe.


STANDARD 4-1-1:
Prior to imputation the data must be edited. Data editing is an iterative and interactive process that includes procedures for detecting and correcting errors in the data. Data editing must be repeated after the data are imputed, and again after the data are altered during disclosure risk analysis. At each stage, the data must be checked for:

  1. Credibility based on range checks to determine if all responses fall within a prespecified reasonable range.
     
  2. Consistency based on checks across variables within individual records for noncontradictory responses and for correct flow through prescribed skip patterns.
     
  3. Completeness based on the amount of nonresponse and involves efforts to fill in missing data directly from other portions of an individual's record.
     

STANDARD 4-1-2: Key variables in data sets used for cross-sectional estimates must be imputed (beyond overall mean imputation). This applies to cross-sectional data sets and to data from longitudinal data sets that are used to produce cross-sectional estimates (i.e., base year and subsequent freshened samples). (See Appendix B for a discussion of alternative imputation procedures, including the pros and cons of specific approaches).

    GUIDELINE 4-1-2A: In census (universe) data collections, it may not be appropriate to impute data in certain situations (e.g., peer analysis situations or when data for a particular establishment-school, university, or library-are being examined individually).

    GUIDELINE 4-1-2B: When using non-NCES data sets, it is desirable to impute for missing data in those items being used in NCES publications. This is only appropriate when adequate auxiliary information is available.

    GUIDELINE 4-1-2C: Imputation procedures should be internally consistent, be based on theoretical and empirical considerations, be appropriate for the analysis, and make use of the most relevant data available. If multivariate analysis is anticipated, care must be taken to use imputations that minimize the attenuation of underlying relationships. The Chief Statistician should review imputation plans prior to implementation.


STANDARD 4-1-3: In the case of longitudinal data sets, two imputation approaches are acceptable: cross-wave imputations or cross-sectional imputations. Cross-wave imputations may be used to complete missing data for longitudinal analysis or cross-sectional imputations may be used. (Guideline 4-1-2C of this Standard applies here, as well.)


STANDARD 4-1-4: In those cases where a nonresponse bias analysis shows that the data are not missing at random, the amount of potential bias must inform the decision to retain or delete individual items (see Standard 4-4).


STANDARD 4-1-5: In cases where imputation is not used (e.g., items that are not key variables in either cross-sectional or longitudinal analysis), data tables must include a reference to a methodology table or glossary that shows the actual weighted response rates for each unimputed variable included in the report (see Standard 1-3 for the item response rate formula). For individual variables with item response rates less than 85 percent, the variable must be footnoted in the row or column header. The footnote must alert readers to the fact that the response rate is below 85 percent and that missing data have not been explicitly accounted for in the data.


STANDARD 4-1-6: When imputations are used, documentation indicating the weighted proportion of imputed data must be presented for all published estimates based on NCES data. Information about the amount of imputed data in the analysis can be included in the technical notes and does not have to accompany each table. The range of the amount of imputation used for the set of items included in an analysis must be reported. Also, the amount of imputation must be reported for items with response rates less than 70 percent. Items with response rates lower than 70 percent must be footnoted in the tables.


STANDARD 4-1-7: All imputed values on a data file must be clearly identified as such.

    GUIDELINE 4-1-7A: Imputed data should be flagged in associated "flag" fields. The imputation method should be identified in the flag. Blanks are not legitimate values for flags.


STANDARD 4-1-8: If nonimputed items are used in the estimation of totals or ratios (as in Standard 4-1-3 above), the risks of not using imputed data must be described.

  1. Estimated totals using nonimputed data implicitly impute a zero value for all missing data. These zero implicit imputations will mean that the estimates of totals will underestimate the true population totals. Thus, when reporting totals based on a nonimputed item, the response rate for that item must be footnoted in the data table.
     
  2. Ratios (averages) using nonimputed data will implicitly impute the cell ratio for all missing data within the cell. This can cause inconsistencies in the estimates between tables.