Skip to main content
Skip Navigation

Statistical Significance and Sample Size

When the National Center for Education Statistics (NCES) reports differences in results, these results reflect statistical significance. Understanding statistical significance in large-scale assessments, how results are estimated, and the influence of sample size are important when interpreting NAEP data in The Nation's Report Card. Explore this important guide to NAEP results and the use of statistical significance in NAEP data.

Statistical Significance

The differences between scale scores and between percentages discussed in the results take into account the standard errors associated with the estimates. Comparisons are based on statistical tests that consider both the magnitude of the difference between the group average scores or percentages and the standard errors of those statistics. Throughout the results, differences between scores or between percentages are discussed only when they are significant from a statistical perspective.

All differences reported are significant at the 0.05 level with appropriate adjustments for multiple comparisons. The term "significant" is not intended to imply a judgment about the absolute magnitude or the educational relevance of the differences. It is intended to identify statistically dependable population differences to help inform dialogue among policymakers, educators, and the public.

Comparisons across states use a t-test to detect whether a difference is statistically significant. A t test is the method most commonly used to evaluate the differences in means between two groups. There are four possible outcomes when comparing the average scores of jurisdictions A and B:

  • Jurisdiction A has a higher average score than jurisdiction B,
  • Jurisdiction A has a lower average score than jurisdiction B,
  • No difference in scores is detected between jurisdiction A and B, or
  • The sample does not permit a reliable statistical test. (This may occur when the sample size for a particular group is small.)

When comparing all jurisdictions to each other, the testing procedures are based on all pairwise combinations of the jurisdictions in a particular year or pair of years. It may be possible that a given state or jurisdiction has a numerically higher average scale score than the nation or another state but that the difference is not statistically significant, while another state with the same average score may show a statistical significance compared to the nation or the other state. These situations may arise due to the fact that standard errors vary across states/jurisdictions and estimates.

NAEP results should not be compared without considering statistical significance.

Results Are Estimates

The average scores and percentages presented are estimates because they are based on the achievement data of representative samples of students rather than on the entire population of students. Moreover, no single student takes the entire NAEP assessment. Each student answers only a sample of questions in up to two subject areas. As such, NAEP results are subject to a measure of uncertainty, reflected in the standard error of the estimates. The standard errors for the estimated scale scores and percentages in the figures and tables presented are available through the NAEP Data Explorer.

Influence of Sample Size

Sampling Variance

Like any survey based on a sample, NAEP results are subject to uncertainty. This uncertainty is reflected by the standard error of NAEP estimates; the more precise the estimate, the smaller the standard error.

The first source of uncertainty arises from the fact that NAEP only assesses a sample of students, rather than every eligible student (a census). The sample consists of a number randomly selected students. Carefully constructed surveys can yield very precise estimates of population quantities. But, a different, equally good sample of students could be have been selected, and the results based on the second sample would be slightly different. Thus, the first component of the standard error is due to sampling of students, termed "sampling variance."

In a good sampling design, the sampling variance decreases as the number of students selected increases. Large groups will tend to have smaller standard errors than smaller groups. A NAEP national assessment typically contains about 10,000 students. Some NAEP assessments include separate, state level samples of over 2,000 per state, which are combined to produce national results. These state-national assessments result in total samples of approximately 140,000 students. Thus, results for the nation based on NAEP state –national assessments will have much smaller standard errors than results from NAEP national only assessments.

Measurement Variance

The second source of uncertainty in NAEP results is due to "measurement." Measurement variance arises from the fact that a student’s proficiency in a subject (e.g., how good the student is at mathematics), is not directly observed, but has to be estimated based on the answers that the student provides to the items on the assessment. It is possible that, were the assessment given on a different day, the student might provide slightly different answers. Similarly, a different version of the assessment, comprised of different but equally valid items, would give slightly different estimates of students’ proficiency. These two factors give rise to what is typically termed "measurement variance."

NAEP assessments contain a third, related source of measurement uncertainty, due to sampling of items. The contents of all NAEP assessments are created according the specifications of a framework, which is created by the National Assessment Governing Board. NAEP frameworks are quite broad and multifaceted, and the resulting assessments are long. Taking the full assessment would require approximately 5-6 hours for each student, which is unreasonable to ask of students. To limit the burden on individual students, NAEP items are grouped into blocks requiring 25-30 minutes to complete. Each student receives a book of two blocks. The fact that students did not take the entire assessment is an additional source of measurement uncertainty.

Linking Variance

A third source of uncertainty affects some NAEP comparisons. In 2017, NAEP began its transition from paper and pencil format to digital format. In the digital assessment, the items are presented, and students respond, using a tablet. The transition from paper mode to digital mode required a special study in which two parallel (randomly equivalent) groups of students took the assessment, one on paper and the other on tablet. Based upon the responses of these two groups of students, the scale of the digital assessment was "linked" to the existing paper and pencil NAEP scale. Because the linking is based on samples of students, the link could have been slightly different if different students were sampled for the study. This source of uncertainty is termed "linking variance."

Linking variance is relevant when results from a digital assessment (e.g., 2017 mathematics) are compared to paper-based results (e.g., 2013 mathematics). Standard errors for comparisons within the same mode of testing, such as of paper-to-paper (2015 v. 2013 mathematics), or comparisons of digital to digital (2017 boys’ proficiency compared to 2017 girls’ proficiency) do not include linking variance.

Reported Standard Variance

The final variance of a NAEP result is the sum of sources of variance. The standard error is the square root of that variance.


Last updated 14 October 2021 (AA)