NAEP Technical DocumentationEstimation of IRT Item Parameters

The probability for a student with an underlying performance level of θ_k on scale k to have response i for item j is P_ji (θ_k), where P_ji (θ_k) is of the form appropriate to the type of item (dichotomous or polytomous). After x is observed,

can be viewed as a likelihood function, where m_j is equal to 2 for dichotomous items or to the number of categories defined for polytomous items, and u_ji is an indicator variable defined by

The likelihood function is an equation that relates the responses for a student to the item parameters for the items the student answered and to the student's underlying performance level. The item parameters for NAEP assessments are estimated using marginal maximum likelihood methods. Marginal maximum likelihood methods are iterative procedures used to estimate item parameters in which an initial distribution of scale scores is assumed for the sample of students. Based on the initial distribution of scale scores, interim estimates of the item parameters can be calculated. Then these interim item parameter estimates are used to calculate a new and improved interim distribution of scale scores. From this interim distribution of scale scores new interim estimates of the item parameters are calculated. This procedure is repeated until the numerical values for the item parameters and scale score distributions converge on estimates that best fit the IRT model. After they are estimated, the item parameters are treated as known in subsequent calculations. Then, a likelihood function for the scale score θ_k for a student is induced by a vector of responses to the subset of the calibrated items the student answered. The likelihood function for the scale score for a student makes it possible to make inferences about the scale score distributions from samples where no student is administered all of the items.

For the purposes of estimation and reporting item parameter estimates and other intermediary estimates, the linear indeterminacies of the Item Response Theory (IRT) models are resolved by an arbitrary choice of the origin and unit size in a given scale. In most cases, a provisional scale standardizing the score scale distribution to have mean 0 and standard deviation 1 is employed. So, the item parameters for NAEP are reported on a provisional 0, 1 scale. Final results for each subject area are linearly transformed to the appropriate scale.

In NAEP analyses, for subject areas with multiple scales (i.e., national main geography, mathematics, reading, science, U.S. history, and music), the parameters of the items constituting each of the separate subscales are estimated independently of the parameters of the other subscales.

Estimates of item parameters were obtained using a NAEP BILOG/PARSCALE program, which combines Mislevy and Bock's (1982) BILOG and Muraki and Bock's (1991) PARSCALE computer programs, and which concurrently estimates parameters for all items (dichotomous and polytomous). The NAEP BILOG/PARSCALE program has also been adapted to make use of student sampling weights.

Last updated 15 January 2010 (GF)

Printer-friendly Version