To estimate the scale score distributions for populations and subgroups of students, population-structure models are used in NAEP. Population-structure models are fit separately for each grade of an assessed subject for each sample of students. In NAEP state assessments, the models are fit separately for each participating jurisdiction (state, territory, the District of Columbia or Department of Defense educational unit). When a subject area has more than one scale, the models are defined for multivariate scale score distributions.
The population-structure model relates underlying performance, , as defined by Item Response Theory (IRT) models to background membership, y, through the parameters Γ and Σ using the equation
where ε has a multivariate normal distribution with mean zero and variance-covariance matrix Σ.
Estimates of Γ and Σ are calculated using marginal maximum likelihood methods, that integrate out the individual student underlying performance. This allows for the calculation of the most appropriate estimates for Γ and Σ for the entire sample of students. Marginal maximum likelihood methods used in NAEP are iterative procedures in which an initial distribution of scale scores is assumed for the sample of students. Based on the initial distribution of scale scores and the item parameter estimates estimated using IRT models (assumed to be fixed), interim estimates of the population-structure model parameters can be calculated. Then these interim item parameter estimates are used to calculate a new and improved interim distribution of scale scores. From this interim distribution of scale scores new interim estimates of the population-structure model parameters are calculated. This procedure is repeated until the numerical values for the population-structure model parameters and scale score distributions converge on estimates that best fit the population-structure model. After they are estimated, the population-structure parameters are used in subsequent calculations to estimate the scale score distributions for groups of students.
For NAEP assessments in which one IRT scale is created or in which two IRT subscales are created, a computer program called BGROUP estimates the parameters of the population-structure model (Thomas 1994). For subject areas with multiple scales, calculations are more complex and a program using different estimation routines, called CGROUP, is used (Thomas 1993a). For estimation of group means on a single scale, CGROUP and BGROUP results will be nearly identical (Thomas 1994). For more details on current estimation procedures used in NAEP, refer to Mazzeo, Donoghue, and Johnson (2002).