Starting in 1998, the False Discovery Rate (FDR) procedure was used to increase the power of statistical tests in NAEP. Under FDR the expected proportion of falsely rejected hypotheses is controlled. Hence, if an α of 0.05 is selected, about 95 percent of the hypothesis tests made rejected the null hypothesis correctly, while about five percent of the hypothesis tests made rejected the null hypothesis incorrectly. Familywise procedures are considered conservative for large families of comparisons. Therefore, the FDR procedure is more suitable for multiple comparisons in NAEP than other procedures (Williams, Jones, and Tukey 1999). The FDR procedure used in NAEP is described in Benjamini and Hochberg (1995).
Frequently, groups (or families) of comparisons are made and presented as a single set. The appropriate text, usually a set of sentences or a paragraph, is selected for inclusion in a report based on the results for the entire set of comparisons. For example, some reports contain a section that compared average scale scores for a predetermined group, generally the majority group (in the case of race/ethnicity, for example, White students) to those obtained by other minority groups. The entire set of tests was presented in the summary data tables. The t test used by NAEP and the certainty ascribed to intervals (e.g., a 95 percent confidence interval) are based on statistical theory that assumes that only one confidence interval or test of statistical significance is being performed. However, in some sections of a report, many different groups are compared (i.e., multiple sets of confidence intervals are being analyzed). In sets of confidence intervals, statistical theory indicates that certainty associated with the entire set of intervals is less than that attributable to each individual comparison from the set. To hold the significance level for the set of comparisons at a particular level (e.g., 0.05), adjustments—called multiple comparison procedures—must be made to the methods. One such procedure, the FDR procedure (Benjamini and Hochberg 1995) was used to control the certainty level.
The Benjamini and Hochberg application of the FDR criterion can be described as follows. Let q be the number of significance tests made and let P1 ≤ P2 ≤ . . . ≤ Pq be the ordered significance levels of the q tests, from lowest to highest probability. Let α be the combined significance level desired, usually 0.05 for one-tailed tests (or 0.025 for two-tailed tests). The procedure compares Pq with α, Pq - 1 with α (q - 1)/q, . . ., Pq with α /q, stopping the comparisons with the first j such that Pj ≤ α · j/q. All tests associated with P1 . . . Pj are declared significant; all tests associated with Pj + 1 , . . . , Pq are declared nonsignificant.
Unlike the other multiple comparison procedures (e.g., the Bonferroni procedure) that control the familywise error rate (i.e., the probability of making even one false rejection in the set of comparisons), the FDR procedure controls the expected proportion of falsely rejected hypotheses. Furthermore, familywise procedures are considered conservative for large families of comparisons (Williams, Jones, and Tukey 1999). Therefore, the FDR procedure is more suitable for multiple comparisons in NAEP than other procedures.