The usual practice in testing is to derive population statistics (such as an average score or the percent of students who surpass a standard) from individual test scores. When the individual test scores are based on enough items to precisely estimate individual scores and all test forms are the same or parallel in form, this would be a valid approach. When this happens, the test scores are known first, and the population values are derived from them. In contrast, NAEP derives its population values directly from the responses to each question answered by a representative sample of students, without ever calculating individual test scores. For NAEP, the population values are known first. Then the plausible values are derived from them and used to calculate values of interest.
The five sets of plausible values are not test scores for individuals in the usual sense. These distributional draws are offered only as intermediary computations for calculating estimates of population characteristics.
When the underlying models are correctly specified, the plausible values will provide valid estimates of population characteristics, even though they are not generally valid estimates of the proficiencies of the individuals with whom they are associated. When the underlying models are incorrectly specified, the plausible values do not necessarily provide valid estimates of population characteristics. For instance, when a group-defining variable is not included in the population-structure model, mean scores for the groups defined by that variable, based on plausible values, may or may not be good estimates of the group means. This is very unusual in NAEP because variables that define groups of interest are included in NAEP models.
The key idea lies in the contrast between the plausible values and the more familiar estimates of individual scale scores that are in some sense optimal for each examinee. Point estimates that are optimal for individual students have distributions that can produce decidedly nonoptimal estimates of population characteristics (Little and Rubin 1983). Plausible values, on the other hand, are constructed explicitly to provide valid estimates of population effects. Note that appropriate point estimates of individual scale scores cannot be calculated by averaging the five plausible values attached to a student's file. Using averages of the five plausible values attached to a student's file is inadequate to calculate group summary statistics such as proportions above a certain level or to determine whether group means differ from one another. For further discussion see Mislevy, Beaton, Kaplan, and Sheehan (1992).