Skip Navigation
small NCES header image

Statistical Standards
Statistical Standards Program
Table of Contents
1. Development of Concepts and Methods
2. Planning and Design of Surveys
3. Collection of Data
4. Processing and Editing of Data
5. Analysis of Data / Production of Estimates or Projections
6. Establishment of Review Procedures
7. Dissemination of Data
Appendix A
Appendix B
Appendix C
Appendix D
Publication information

For help viewing PDF files, please click here

An accommodation is a change in how a test is presented, in how a test is administered, or in how the test taker is allowed to respond. This term generally refers to changes that do not substantially alter what the test measures. The proper use of accommodations does not substantially change academic level or performance criteria. Appropriate accommodations are made to provide equal opportunity to demonstrate knowledge.
An African American or Black person has origins in any of the black racial groups of Africa. Terms such as "Haitian" or "Negro" can be used in addition to "Black or African American."
An American Indian or Alaska Native person has origins in any of the original peoples of North and South America (including Central America), and who maintains tribal affiliation or community attachment.
An Asian person has origins in any of the original peoples of the Far East, Southeast Asia, or the Indian subcontinent, including, for example, Cambodia, China, India, Japan, Korea, Malaysia, Pakistan, the Philippine Islands, Thailand, and Vietnam.
An assessment is any systematic procedure for obtaining information from tests and other sources that can be used to draw inferences about characteristics of people, objects, or programs.
An award incentive plan links all or some of the contract deliverables to performance incentive payments beyond the fixed fee of the contract. There are minimum performance-based requirements that must be specified in order for a contract to be considered as an Award Incentive performance-based contract.

The base weight is the inverse of the probability of selection.
A bridge study continues an existing methodology concurrent with a new methodology for the purpose of defining the relationship between the new and old estimates.
A Black or African American person has origins in any of the black racial groups of Africa. Terms such as "Haitian" or "Negro" can be used in addition to "Black or African American."

The capture/recapture technique uses two independent frames to estimate the number of units missed on both frames. The first step is to match frames to provide counts of units on one frame, but not the other; as well as a count of units on both frames. With this information and several basic assumptions, it is possible to estimate the number of units missed on both frames. In practice, the two frames may not be completely independent; in which case, a number of assumptions will be necessary to proceed with this type of estimation.
Classical test theory postulates that a test score can be decomposed into two parts-a true score and an error component; that the error component is random with a mean of zero and is uncorrelated with true scores; and that observed scores are linearly related to true scores and error components.
Clustered samples are those in which a naturally occurring group is first selected, such as a school or a residential block, and then units are sampled within the selected groups.
Coarsening disclosure limitation techniques preserve the individual respondent's data by reducing the level of detail used to report some variables. Examples of this technique include: recoding continuous variables into intervals; recoding categorical data into broader intervals; and top or bottom coding the ends of continuous distributions.
Confidentiality involves the protection of individually identifiable data from unauthorized disclosures.
Confidentiality edits are defined as edits that are applied to microdata for the purpose of protecting data that will be released in tabular form. Confidentiality edits are implemented using perturbation techniques. These techniques are used to alter the responses in the microdata file before tabulations are produced. Thus, all tables are protected in a consistent way. Because the perturbation techniques that are used are designed to preserve the level of detail in the microdata file, confidentiality edits maximize the information that can be provided in tables, without requiring cell suppression or controlled rounding.
A consistent data series maintains comparability over time by keeping an item fixed, or by incorporating appropriate adjustment methods in the event an item is changed.
To be recognized as a Consolidated Metropolitan Statistical Area (CMSA) an area must meet the requirements for recognition as an MSA, have a total population of one million or more, and have: (1) separate component areas that can be identified within the entire area by meeting specified statistical criteria, and (2) local opinion that indicates support for the component areas.
Coverage refers to the extent to which all elements on a frame list are members of the population, and to which every element in a population appears on the frame list once and only once.
Coverage error refers to the discrepancy between statistics calculated on the frame population and the same statistics calculated on the target population. Undercoverage errors occur when target population units are missed during frame construction, and overcoverage errors occur when units are duplicated or enumerated in error.
A crosswalk study delineates how categories from one classification system are related to categories in a second classification system.
A cross-sectional sample survey is based on a representative sample of respondents drawn from a population at one point in time.
Cross-sectional imputations are based on data from a single time period.
Cross-wave imputations are imputations based on data from multiple time periods. For example, a cross-sectional imputation for a time 2 salary could simply be a donor's time 2 salary. Alternatively, a cross-wave imputation could be the change in a donor's salary from time 1 to time 2 multiplied by the time 1 nonrespondent's salary.
A cut score is a specified point on a score scale such that scores at or above that point are interpreted or acted upon differently from scores below that point.

A Data Analysis System (DAS) is an analysis software system that generates tabular estimates and correlation coefficients in a framework that allows external users to analyze individually identifiable data without allowing the user direct access to individual data records. Users are denied access to individual data records because the data are not in a directly readable format. Additional safeguards come through the use of population subsampling and differential weighting from the sample design, as well as confidentiality edits. The degree of editing required is a direct function of the capabilities of the DAS. As an example, a DAS that provides weighted totals (i.e., a direct measure of population size) within cells would require more confidentiality editing than one that does not provide weighted cell totals, because there is a greater risk of disclosure in groups with small population size.
Data swapping is a perturbation disclosure limitation technique that results in a confidentiality edit. A simplistic example of data swapping would be to assume a data file has two potential individual identifying variables, for example, sex and age. If a sample case needs disclosure protection, it is paired with another sampled case so that each element of the pair has the same age, but different sexes. The data on these two records are then swapped. After the swapping, anyone thinking they have identified either one of the paired cases gets the data of the other case, so they have not made an accurate match and the data have been protected.
DEFT is the square root of a design effect.
A derived score is a raw score converted by numerical transformation into a new score providing a more meaningful and/or different measure (e.g., conversion of raw scores to percentile ranks, standard scores, or grade equivalence).
The design effect (DEFF) is the ratio of the true variance of a statistic (taking the complex sample design into account) to the ----variance of the statistic for a simple random sample with the same number of cases. Design effects differ for different subgroups and different statistics; no single design effect is universally applicable to any given survey or analysis.
Differential Item Functioning (DIF) exists when examinees of equal ability differ on an item solely because of their membership in a particular group.
Disability is a physical or mental impairment that substantially limits one or more of the major life activities (42 U.S.C. 12102).
Disclosure risk analysis is used to determine which records require masking to produce a public-use data file from a restricted-use data file.
Domain refers to a defined universe of knowledge, skills, abilities, attitudes, interests, or other human characteristics.
Dual-frame estimation uses a dual-frame design to combine two frames in the same survey to offer coverage rates that may exceed those of any single frame. Sometimes the best available list is known to have poor coverage and there are no known supplemental frames to provide sufficient coverage. For example, an area frame could be used as the second frame.

Editing is a procedure that uses available information and some assumptions to derive substitute values for inconsistent values in a data file.
Effect size refers to the standardized magnitude of the effect or the departure from the null hypothesis. For example, the effect size may be the amount of change over time, or the difference between two population means, divided by the appropriate population standard deviation. Multiple measures of effect size can be used (e.g., standardized differences between means, correlations, and proportions).
The effective sample size, as used in the design phase, is the sample size under a simple random sample design that is equivalent to the actual sample under the complex sample design. In the case of complex sample designs, the actual sample size is determined by multiplying the effective sample size by the anticipated design effect.
Equating of two tests is established when examinees of every ability level and from every population group can be indifferent about which of two tests they take. Not only should they have the same expected mean score on each test, but they should also have the same errors of measurement.
Estimation is the process of using sample data to provide a single best value for a parameter (such as a mean, proportion, correlation, or effect size), or to provide a range of values in the form of a confidence interval.

Fairness of a test is attained when construct-irrelevant personal characteristics such as race, ethnicity, sex, or disability have no appreciable effect on test results or their interpretation.
In a field test all or some of the survey procedures are tested on a small scale that mirrors the planned full-scale implementation.
A frame is a mapping of the universe elements (i.e., sampling units) onto a finite list (e.g., the population of schools on the day of the survey).
The frame population is the set of elements that can be enumerated prior to the selection of a survey sample.
A freshened sample includes new cases added to a longitudinal sample plus the retained cases from the longitudinal sample used to produce cross-sectional estimates of the population at the time of a subsequent wave of a longitudinal data collection.

The half-open interval technique is used to increase coverage. In this technique, new in-scope units between a unit A on the previous frame up to, but not including, unit B (the next unit on the previous frame) are associated with unit A. These new units have the same selection probability as unit A's. This process is repeated for every unit on the frame. The new units associated with the actual sample cases are now included in the sample with their respective selection probabilities. For example, in the case of freshening the sample, this technique may be applied to a new list that includes cases that were covered in a previous frame, as well as new in-scope units not included in the previous frame.
A Hispanic or Latino person is of Cuban, Mexican, Puerto Rican, Cuban, South or Central American, or other Spanish culture or origin, regardless of race. The term "Spanish origin" can be used in addition to "Hispanic or Latino."
Hypothesis testing draws a conclusion about the tenability of a stated value for a parameter. For example, sample data may be used to test whether an estimated value of a parameter (such as the difference between two population means) is sufficiently different from zero that the null hypothesis, designated H0 (no difference in the population means), can be rejected in favor of the alternative hypothesis, H1 (a difference between the two population means).

Imputation is a procedure that uses available information and some assumptions to derive substitute values for missing values in a data file.
An Individualized Education Plan (IEP) refers to a written statement for each individual with a disability that is developed, reviewed, and revised in accordance with Title 42 U.S.C. Section 1414(d).
Individually identifiable data refers specifically to data from any list, record, response form, completed survey, or aggregation about an individual(s) from which information about particular individuals or their schools/education institutions may be revealed by either direct or indirect means.
Instrument refers to an evaluative device that includes tests, scales, and inventories to measure a domain using standardized procedures.
Item nonresponse occurs when a respondent fails to respond to one or more relevant item(s) on a survey.
Item Response Theory (IRT) postulates that the probability of correct responses to a set of test questions is a function of true proficiency and of one or more parameters specific to each test question.

Key variables include survey-specific items for which aggregate estimates are commonly published by NCES. They include, but are not restricted to, variables most commonly used in table row stubs. Key variables also include important analytic composites and other policy-relevant variables that are essential elements of the data collection. They are first defined in the initial planning stage of a survey, but may be added to as the survey and resulting analyses develop. For example, the National Assessment of Educational Progress (NAEP) consistently uses gender, race-ethnicity, urbanicity, region, and school type (public/private) as key reporting variables.

A Latino or Hispanic person is of Cuban, Mexican, Puerto Rican, Cuban, South or Central American, or other Spanish culture or origin, regardless of race. The term "Spanish origin" can be used in addition to "Hispanic or Latino."
Linkage results from placing two or more tests on the same scale, so that scores can be used interchangeably.
A longitudinal sample survey follows the experiences and outcomes over time of a representative sample of respondents (i.e. a cohort) who are defined based on a shared experience (e.g. shared birth year or grade in school).

Metadata contain information about the microdata.
Metropolitan Statistical Areas (MSAs) are those areas that: (1) include a city of at least 50,000 population, or (2) include a Census Bureau-defined urbanized area (of at least 50,000 population) with a total metropolitan population of at least 100,000 (75,000 in New England). In addition to the county(ies) containing the main city or urbanized area, an MSA may include additional counties that have strong economic and social ties to the central county(ies) and meet specified requirements of metropolitan character. The ties are determined chiefly by census data on commuting to work. A metropolitan statistical area may contain more than one city with a population of 50,000 and may cross state lines.
The minimum substantively significant effect (MSSE) is the smallest effect, that is, the smallest departure from the null hypothesis, considered to be important for the analysis of key variables. The minimum substantively significant effect is determined during the design phase. For example, the planning document should provide the minimum change in key variables or perhaps, the minimum correlation, r, between two variables that the survey should be able to detect for a specified population domain, or subdomain of analytic interest. The MSSE should be based on a broad knowledge of the field, related theories, and supporting literature.
Multiplicity estimation is a technique used to adjust selection probabilities when the unit of interest has multiple chances of being selected. For example, in a random digit dialing household survey, households with multiple phone numbers have a probability of being selected more than once. In this case by identifying the number of distinct telephone numbers in a household, the sampling weights can be adjusted to generate an unbiased household weight.

A Native Hawaiian or Other Pacific Islander person has origins in any of the original peoples of Hawaii, Guam, Samoa, or other Pacific Islands.
New England County Metropolitan Areas (NECMAs) are county-based alternatives to the city- and town-based metropolitan areas that are used in the rest of the country. The NECMA for an MSA or CMSA includes: (1) the county containing the city named first in that MSA/CMSA title (this county may include the cities named first for other MSAs/CMSAs), and (2) each additional county having at least half its population in the MSA/CMSA(s) whose cities that are listed first are in the county identified in step 1. NECMAs are not defined for individual PMSAs.
Noncoverage involves eligible units of the target population that are missing from the frame population; this includes the problems of incomplete frames and missing units.
Nonresponse bias occurs when the observed value deviates from the population parameter due to differences between respondents and nonrespondents. Nonresponse bias is likely to occur as a result of not obtaining 100 percent response from the selected cases.
Nonsampling error includes measurement errors due to nonresponse, coverage, interviewers, respondents, instruments, processing, and mode.

An Other Pacific Islander or Native Hawaiian person has origins in any of the original peoples of Hawaii, Guam, Samoa, or other Pacific Islands.
Overall unit nonresponse reflects a combination of unit nonresponse across two or more levels of data collection, where participation at the second stage of data collection is conditional upon participation in the first stage of data collection.
Overcoverage errors occur when units are duplicated or enumerated in error.

Perturbation disclosure limitation techniques directly alter the individual respondent's data for some variables, but preserve the level of detail in all variables included in the microdata file. Blanking and imputing for randomly selected records; blurring (e.g., combining multiple records through some averaging process into a single record); adding random noise; and data swapping or switching (e.g., switching the sex variable from a predetermined pair of individuals) are all examples of perturbation techniques.
In a pilot test a laboratory or a very small-scale test of a questionnaire or procedure is conducted.
A planning document includes a justification for a study, a description of the survey design and methodology, an analysis plan, a survey evaluation plan, and a cost estimate.
The potential magnitude of nonresponse bias can be estimated by taking the product of the nonresponse rate and the difference in values of a characteristic between respondents and nonrespondents.
The power (1-b) of a test is defined as the probability of rejecting the null hypothesis when a specific alternative hypothesis is assumed. For example, with b = 0.20 for a particular alternative hypothesis, the power is 0.80, which means that 80 percent of the time the test statistic will fall in the rejection region if the parameter has the value specified by the alternative hypothesis.
Precision of survey results refers to how closely the results from a sample can reproduce the results that would be obtained from a complete count (i.e., census) conducted using the same techniques. The difference between a sample result and the result from a complete census taken under the same conditions is known as the precision of the sample result.
A survey pretest involves experimenting with different components of the questionnaire or survey design or operationalization prior to full-scale implementation. This may involve pilot testing, that is a laboratory or a very small-scale test of a questionnaire or procedure, or a field test in which all or some of the survey procedures are tested on a small scale that mirrors the planned full-scale implementation.
A point estimate involves using the value of a particular sample statistic to estimate the value for a parameter of interest.
Primary Metropolitan Statistical Areas (PMSAs) are then the component areas of a CMSA. If no PMSAs are recognized, the entire area is designated an MSA.
The probability of selection is the probability that an element will be drawn in a sample. In a simple random selection, this probability is the number drawn in the sample divided by the number of elements on the sampling frame.
A public-use data file includes a subset of data that have been coded, aggregated, or otherwise altered to mask individually identifiable information, and thus, is available to all external users. Unique identifiers, geographic detail, and other variables that cannot be suitably altered are not included in public-use data files.
Public-use edits are based on an assumption that external users have access to both individual respondent records and secondary data sources that include data which could be used to identify respondents. For this reason, the editing process is relatively extensive. When determining an appropriate masking process, the public-use edit takes into account and guards against matches on common variables from all known files that could be matched to the public-use file.

Raking is a method of adjusting sample estimates to known marginal totals from an independent source. For a two-dimensional case, the procedure uses the sample weights to proportionally adjust the weights to one set of marginals. Next, these adjusted weights are proportionally adjusted to the second set of marginals. This two-step adjustment process is repeated a number of times until the adjusted sample weights converge simultaneously to both sets of marginals.
A random-digit dial sample survey randomly selects respondents based on a sample of phone numbers and information obtained using a screener questionnaire.
The reference year is the year about which the data were collected.
The rejection region is defined by the alternative hypothesis H1 and the a level. If the test statistic is in this region, the null hypothesis is rejected.
Reliability is the degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable and repeatable for an individual test taker.
Replication methods are approximate variance methods that estimate the variance based on the variability of estimates formed from subsamples of the full sample. The subsamples are generated to properly reflect the variability due to the sample design.
Required response items include the minimum set of items required for a case to be considered a respondent.
Response rates calculated using base weights measure the proportion of the sample frame that is represented by the responding units in each study.
A restricted-use data file includes individually identifiable information that is confidential and protected by law. Restricted-use data files are not required to include variables that have undergone coarsening disclosure risk edits.

Sampling error is the error associated with nonobservation, that is, the error that occurs because all members of the frame population are not measured. It is the error associated with the variation in samples drawn from the same frame population. The variance equals the square of the sampling error.
Scaling refers to the process of assigning a scale score based on the pattern of responses.
Scoring/rating is the process of evaluating the quality of the examinee's responses to individual cognitive questions.
Section 504 of the Rehabilitation Act of 1973, as amended (Title 29 U.S.C. 794 Section 504), prohibits discrimination on the basis of handicap in federally assisted programs and activities.
Simple comparison is a test (such as a t test or a z test), of the difference between two means or proportions.
Simple Random Sampling (SRS) uses equal probability sampling with no strata or clusters. Most statistical analysis software assumes SRS and independently distributed errors.
Stage of data collection includes any stage or step in the sample identification and data collection process in which data are collected from the identified sample unit. This includes information obtained that is required to proceed to the next stage of sample selection or data collection (e.g., school district permission for schools to participate or schools providing lists of teachers for sample selection of teachers).
Statistical disclosure limitation techniques are used to prepare microdata files for release, included are perturbation techniques and coarsening techniques.
A statistical inference is a decision about one or more unknown or unobserved population parameter(s) based on estimation and/or hypothesis testing.
Strata are created by partitioning the frame; and are generally defined to include relatively homogeneous units within strata.
Substitutions are done using matched pairs, in which the alternate member of the pair does not have an independent probability of selection.
A supplemental area frame can be created. This is often done by first, generating a frame of geographic units where all the geographic units are represented providing full geographic coverage. Next, a probability sample of the geographic units is selected. An intensive search procedure is carried out in each selected area. This generates a supplemental area frame for each selected area. Assuming no error in the search process, the supplemental area frame has complete coverage and the cases can be weighted to represent a national estimate. The data from both the main list frame and the supplemental area frame are then combined so that the weighted sample estimates provide complete coverage.
An individual survey is driven by one data collection form, such as the Private School Survey or the Academic Library Survey.
A survey system is a set of individual surveys that are interrelated components of a data collection, such as the Schools and Staffing Survey or the Integrated Postsecondary Education Data System.
The survey year is the year in which the data were collected.

The tail of the sampling distribution of the test statistic contains the rejection region for the hypothesis tested, H0.
The target population is the finite set of observable or measurable elements (i.e., sampling units) that will be studied.
Taylor-series linearization is an approximate variance method in which an estimate is linearized as a first step. The variance of the linearized estimate is then computed using either an exact or approximate variance formula appropriate for the sample design.
Total nonresponse reflects a combination of the overall unit nonresponse and item nonresponse for a specific item.
Type I error is made when the tested hypothesis, H0, is falsely rejected when in fact it is assumed true. The probability of making a Type I error is denoted by alpha (a). For example, with an alpha level of 0.05, the analyst will conclude that a difference is present in 5 percent of tests where the null hypothesis is true.
Type II error is made when the null hypothesis, H0, is not rejected when in fact a specific alternative hypothesis, H1, is assumed true. The probability of making a Type II error is denoted by beta (b). For example, with a beta level of 0.20, the analyst will conclude that no difference is present in 20 percent of all cases in which the specific hypothesized alternative, H1, is true.

Undercoverage errors occur when target population units are missed during frame construction.
Un-duplication involves the process of deleting units that are erroneously in the frame more than once to correct for overcoverage.
Unit nonresponse occurs when a respondent fails to respond to all required response items (i.e., fill out or return a data collection instrument).
A universe survey involves the collection of data covering all known units in a population (i.e. a census).

Validity is the extent to which a test or set of operations measures what it is supposed to measure. Validity refers to the appropriateness of inferences from test scores or other forms of assessment.
Variance is the error associated with nonobservation, that is, the error that occurs because all members of the frame population are not measured. It is the error associated with the variation in samples drawn from the same frame population. The variance equals the square root of the sampling error.

A wave is a round of data collection in a longitudinal survey (e.g., the base year and each successive follow-up are each waves of data collection).
A White person has origins in any of the original peoples of Europe, the Middle East, or North Africa.

Would you like to help us improve our products and website by taking a short survey?

YES, I would like to take the survey


No Thanks

The survey consists of a few short questions and takes less than one minute to complete.
National Center for Education Statistics -
U.S. Department of Education