A valid question is why NAEP uses population-structure models, when other assessment programs do not. Often these other assessment programs, like NAEP, use Item Response Theory (IRT) models to describe the relationship between student-level item responses and item characteristics. The basic difference between these other assessment programs and NAEP is the primary purpose of the assessment. In NAEP, the primary purpose is to provide information about what populations and groups of students know and can do. In other assessment programs, the purpose is to provide information about what individual students know and can do.
IRT was developed in the context of measuring individual examinees' abilities. In that setting, each individual is administered enough items (often 60 or more) to permit precise estimation of his or her true scale score. Because the uncertainty associated with each individual's scale score estimate is negligible, the distribution of scale scores for a group of students, or the joint distribution of scale scores with other variables, can then be approximated using an individual's estimated scale score as if they were true scale score values.
This approach breaks down in the assessment setting when, in order to provide broader content coverage in limited testing time, each respondent is administered relatively few items in a subject area. A first problem is that the uncertainty associated with individual scale scores is too large to ignore, and the features of the scale score distribution for groups of students can be seriously biased when estimated using individuals' estimated scale scores as if they were true scale score values. (The failure of this approach was verified in early analyses of the 1984 NAEP reading survey; see Wingersky, Kaplan, and Beaton 1987.) A second problem, occurring even with test lengths of 60, arises when test forms vary across and within assessments as to the numbers, formats, and content of the test items. The measurement error distributions thus differ even if underlying true scale score distributions do not, causing estimated scale score distributions to exhibit spurious changes and resulting in deceptive comparisons in apparent population distributions that may be greater than actual differences over time or across groups. Although this latter problem is avoided in traditional standardized testing by presenting students with parallel test forms, controlled tightly across time and groups, the same constraints cannot be imposed in the design and data-collection phases of the present NAEP. The NAEP methodology, using population-structure models, was developed as a way to estimate key population features consistently, and approximate others no worse than standard IRT procedures would, even when item booklet composition, format, and content balances change over time.