The facts that constructs measured by the NAEP scales are defined by the NAEP frameworks, that NAEP items have been specifically written to fit the frameworks, and that NAEP items have been reviewed many times by content specialists are evidence of the validity of the constructs measured in NAEP assessments. Therefore, the studies of dimensionality that have been completed focus not only on the number of dimensions that underlie the various NAEP instruments, but also on whether there is a sufficiently strong first dimension to support inferences about composite scales in NAEP and on the usefulness of certain methodologies for studying dimensionality when data is collected using a BIB booklet design. It should be noted that if the items that are part of subscales that contribute to a composite score can be reasonably considered to measure the same construct then each of the subscales containing those items would be considered unidimensional. In addition to these studies of dimensionality for the group of items contributing to certain NAEP scales, each item is examined for fit to the NAEP scales via item fit statistics and comparisons of empirical and theoretical item response functions. In general, studies of dimensionality have shown that it is reasonable to treat the data for NAEP scales as unidimensional.
In an early study, the dimensionality of NAEP reading assessment data collected during the 1983-84 academic year was examined by Zwick (1986, 1987). Zwick also studied simulated data designed to mirror the NAEP reading item response data but having known dimensionality. Analysis of the simulated datasets allowed her to determine whether the BIB booklet design artificially increases dimensionality. Zwick found substantial agreement among various statistical procedures, and that the results using BIB booklet designs were similar to results for complete datasets. Overall she concluded that "it is not unreasonable to treat the data as unidimensional" (1987, p. 306).
Rock (1991) studied the dimensionality of the NAEP mathematics and science tests from the 1990 assessment using confirmatory factor analysis. His conclusion was that there was little evidence for "discriminant validity" except for the geometry scale at the eighth-grade level and that "we are doing little damage in using a composite score in mathematics and science" (p. 2).
A second-order factor model was used by Muthén (1991) in a further analysis of Rock's mathematics data to examine subgroup differences in dimensionality. Evidence of content-specific variation within subgroups was found, but the average (across seven booklets) percentages of such variation was very small, ranging from essentially 0 to 22, and two-thirds of these percentages were smaller than 10.
Carlson and Jirele (1992) examined 1990 NAEP mathematics data. Analyses of simulated one-dimensional data were also conducted, and the fit to these data was slightly better than that to the real NAEP data. Although there was some evidence suggesting more than one dimension in the NAEP data, the strength of the first dimension led the authors to conclude that the data "are sufficiently unidimensional to support the use of a composite scale for describing the NAEP mathematics data, but that there is evidence that two dimensions would better fit the data than one" (p. 31).
Carlson (1993) studied the dimensionality of the 1992 mathematics and reading assessments. The relative sizes of fit statistics for simulated as compared to actual data suggested that lack of fit may be more due to the BIB booklet design of NAEP than the number of dimensions fitted. Kaplan (1995) similarly found that the chi-squared goodness of fit statistic in the "maximum likelihood factor analysis" model was inflated when data were generated using a BIB design. The sizes of the fit statistics for incomplete simulation conditions (a BIB design as in the actual NAEP assessment) were more like those of the real data than were those of the case of simulation of a complete data matrix. Consistent with findings of Zwick (1986, 1987), however, the incomplete design for data collection used in NAEP does not appear to be artificially inflating the number of dimensions identified using these procedures.