Mathematics Coursetaking and Achievement at the End of High School:
NCES 2008-319
January 2008

2.3 Statistical Testing

Bivariate comparisons drawn in the text of this report have been tested for statistical significance at the .05 level using t statistics to ensure that the differences are larger than those that might be expected due to sampling variation. In analyses using a large sample, such as the one used in this report, standard errors accompanying estimates are often small and thus small differences between groups are often found to be statistically significant. Since tests of statistical significance reveal whether a relationship between variables is statistically reliable—but tell us little about the strength of the relationship—strength-of-effect measures were obtained to accompany all statistical tests and used as a second criterion to determine whether a result could be reported.

One measure of strength of effect is the effect size. Effect size is the estimated difference between the mean of population A and the mean of population B divided by the pooled standard deviation. The effect size indicates the magnitude of the estimated difference in terms of the number of standard deviations separating the means of the two groups. A standard deviation is the statistical measure of the extent to which values are spread around the mean. The reporting criterion applied to differences in means was an effect size (Cohen's d) of 0.2, or one-fifth of a standard deviation (Cohen 1988). When evaluating effect sizes, the proficiency probability scores, like the IRT-estimated number-right scores, have been treated as means, and are subject to the 0.2 standard deviation criterion. Tables in this report, however, supply estimated proportions as well as means. Therefore for comparisons involving percentage differences between subgroups a strength-of-effect criterion was also set: the criterion for percentages was set at a minimum of 5 percentage points difference.