The usual practice in testing is to derive population statistics (such as an average score or the percent of students who surpass a standard) from individual test scores. When the individual test scores are based on enough items to precisely estimate individual scores and all test forms are the same or parallel in form, this would be a valid approach. When this happens, the test scores are known first, and the population values are derived from them. In contrast, NAEP derives its population values directly from the responses to each question answered by a representative sample of students, without ever calculating individual test scores. For NAEP, the population values are known first. The plausible values can then be processed to retrieve the estimates of score distributions by population characteristics that were obtained in the marginal maximum likelihood analysis for population groups. Plausible values can be thought of as a mechanism for accounting for the fact that the true scale scores describing the underlying performance θ for each student are unknown.
The key idea lies in the contrast between the plausible values and the more familiar estimates of individual scale scores that are in some sense optimal for each examinee. Point estimates that are optimal for individual students have distributions that can produce decidedly non-optimal estimates of population characteristics (Little and Rubin 1983). Plausible values, on the other hand, are constructed explicitly to provide valid estimates of population effects. The twenty sets of plausible values are not test scores for individuals in the usual sense, not only because they represent a distribution of possible scores (rather than a single point), but also because they apply to students taken as representative of the measured population groups to which they belong (and thus reflect the performance of more students than only themselves). These distributional draws from the predictive conditional distributions are offered only as intermediary computations for calculating estimates of population characteristics. Using averages of the twenty plausible values attached to a student's file is inadequate to calculate group summary statistics such as proportions above a certain level or to determine whether group means differ from one another. For further discussion see Mislevy, Beaton, Kaplan, and Sheehan (1992).
The use of plausible values and the large number of student group variables that are included in the population-structure models in NAEP allow a large number of secondary analyses to be carried out with little or no bias, and mitigate biases in analyses of the marginal distributions of θ in variables not in the model (see Potential Bias in Analysis Results Using Variables Not Included in the Model).