The data from the national main assessment samples prior to 2002 and the combined national and states samples after 2002 are scaled using item Item Response Theory (IRT) models. The frameworks for the different subject areas dictate the numbers of IRT scales for each subject. For dichotomously scored items, two- and three-parameter logistic forms of the model are used, while for polytomously scored items the generalized partial credit model form was used. These two types of items and models are combined in the NAEP scales. Item parameter estimates on a provisional scale are obtained using the NAEP BILOG (Mislevy & Bock, 1982)/PARSCALE (Muraki & Bock, 1997)program. The fit of the IRT model to the observed data is examined within each scale by comparing the empirical item response functions with estimated theoretical curves. Plots of the empirical item response functions and estimated theoretical curves are compared across assessments for items in trend assessments. The differential item functioning (DIF) analyses also provide information related to the model fit across subpopulations.
Three distinct scaling models, depending on item type and scoring procedure, are used in the analysis of NAEP data. Each of the models is a "latent variable" model based on IRT (Lord, 1980) and is defined separately for each of the scales. These models express respondents' tendencies to achieve certain scores (such as correct/incorrect) on the items contributing to a scale as a function of a parameter that is not directly observed.