# Statistical Significance and Sample Size

When the National Center for Education Statistics (NCES) reports differences in results, these results reflect statistical significance. Understanding statistical significance, how results are estimated, and the influence of sample size are important when interpreting NAEP data.

Like any survey based on a sample, NAEP results are subject to uncertainty….

## Statistical Significance

The differences between scale scores and between percentages discussed in the results take into account the standard errors associated with the estimates. Comparisons are based on statistical tests that consider both the magnitude of the difference between the group average scores or percentages and the standard errors of those statistics. Throughout the results, differences between scores or between percentages are discussed only when they are significant from a statistical perspective.

All differences reported are significant at the 0.05 level with appropriate adjustments for multiple comparisons. The term "significant" is not intended to imply a judgment about the absolute magnitude or the educational relevance of the differences. It is intended to identify statistically dependable population differences to help inform dialogue among policymakers, educators, and the public.

Comparisons across states use a t-test to detect whether a difference is statistically significant. A t test is the method most commonly used to evaluate the differences in means between two groups. There are four possible outcomes when comparing the average scores of jurisdictions A and B:

• Jurisdiction A has a higher average score than jurisdiction B,
• Jurisdiction A has a lower average score than jurisdiction B,
• No difference in scores is detected between jurisdiction A and B, or
• The sample does not permit a reliable statistical test. (This may occur when the sample size for a particular group is small.)

When comparing all jurisdictions to each other, the testing procedures are based on all pairwise combinations of the jurisdictions in a particular year or pair of years. It may be possible that a given state or jurisdiction has a numerically higher average scale score than the nation or another state but that the difference is not statistically significant, while another state with the same average score may show a statistical significance compared to the nation or the other state. These situations may arise due to the fact that standard errors vary across states/jurisdictions and estimates.

## Influence of Sample Size

### Sampling Variance

Like any survey based on a sample, NAEP results are subject to uncertainty. This uncertainty is reflected by the standard error of NAEP estimates; the more precise the estimate, the smaller the standard error.

The first source of uncertainty arises from the fact that NAEP only assesses a sample of students, rather than every eligible student (a census). The sample consists of a number randomly selected students. Carefully constructed surveys can yield very precise estimates of population quantities. But, a different, equally good sample of students could be have been selected, and the results based on the second sample would be slightly different. Thus, the first component of the standard error is due to sampling of students, termed "sampling variance."

In a good sampling design, the sampling variance decreases as the number of students selected increases. Large groups will tend to have smaller standard errors than smaller groups. A NAEP national assessment typically contains about 10,000 students. Some NAEP assessments include separate, state level samples of over 2,000 per state, which are combined to produce national results. These state-national assessments result in total samples of approximately 140,000 students. Thus, results for the nation based on NAEP state –national assessments will have much smaller standard errors than results from NAEP national only assessments.

### Measurement Variance

The second source of uncertainty in NAEP results is due to “measurement." Measurement variance arises from the fact that a student’s proficiency in a subject (e.g., how good the student is at mathematics), is not directly observed, but has to be estimated based on the answers that the student provides to the items on the assessment. It is possible that, were the assessment given on a different day, the student might provide slightly different answers. Similarly, a different version of the assessment, comprised of different but equally valid items, would give slightly different estimates of students’ proficiency. These two factors give rise to what is typically termed “measurement variance.”

NAEP assessments contain a third, related source of measurement uncertainty, due to sampling of items. The contents of all NAEP assessments are created according the specifications of a framework, which is created by the National Assessment Governing Board. NAEP frameworks are quite broad and multifaceted, and the resulting assessments are long. Taking the full assessment would require approximately 5-6 hours for each student, which is unreasonable to ask of students. To limit the burden on individual students, NAEP items are grouped into blocks requiring 25-30 minutes to complete. Each student receives a book of two blocks. The fact that students did not take the entire assessment is an additional source of measurement uncertainty.