2009 Spotlight

U.S. Performance Across International Assessments of Student Achievement

Technical Notes: A.6 Weighting and scaling

Before the data are analyzed, responses from students are assigned sampling weights to ensure that the proportion or representation of different subgroups of assessed students (e.g., public/private, census region, urban/suburban/rural, race/ethnicity) match the actual percentage of that subgroup among the school population of the target grade or age. The use of sampling weights is necessary for the computation of sound, nationally representative estimates. The basic weight assigned to a student's responses is the inverse of the probability that the student would be selected for the sample. Adjustments to weights are also made by the international consortium for various situations (such as school and student nonresponse) because data cannot be assumed to be randomly missing. (NCES may conduct a nonresponse bias analysis after these adjustments are made to see how much bias still exists, compared with the original sample frame. For more details, see A.11.)

Once these sampling weights are in place, item response theory (IRT) procedures are used to deduce the difficulty of each item, using information about how likely it is for students to get some items correct versus other items. Once the difficulty of each item is determined, the items are assigned a value on a standardized logit scale of item difficulty. Scaling items in this way makes it possible for the ability of groups of students to be estimated or scored, even though not all students were administered the same items.

Scale scores

In order to make the estimated scores more meaningful and to facilitate their interpretation, the scores are transformed to a new scale with a mean of 500 and a standard deviation of 100. These scale scores are what are reported in PIRLS, PISA, and TIMSS reports and throughout this special analysis. Strictly speaking, scale scores are specific to a given assessment and cannot be compared across assessments even within the same study. However, statistical equating procedures are commonly employed to allow comparisons over time between assessments within a study.

For example, the scales from TIMSS 1999 (the scales established for each subject and grade in 1999) were statistically equated with the scales from TIMSS 1995 (the scales established for each subject and grade in 1995) so that the TIMSS 1999 results could be placed on the TIMSS 1995 scales. The scales of each subsequent TIMSS assessment, in turn, have been statistically equated with the 1995 scale for the respective subject and grade. Thus, a TIMSS 8th-grade mathematics score of 500 in 1995, for instance, is equivalent to a TIMSS 8th-grade mathematics score of 500 in 2007.

In PISA, the three subject matter scales were developed successively in the year that each subject was first assessed in depth as the major subject matter domain (i.e., reading in 2000, mathematics in 2003, and science in 2006), and all subsequent assessment scales have been statistically equated with those scales. This is to say, PISA established a reading scale in 2000 and placed PISA 2003 and PISA 2006 reading results on the same scale; PISA established a mathematics scale in 2003 and placed PISA 2006 mathematics results on the same scale; and PISA established a science scale in 2006 and will place future PISA science results on the same scale. Thus, a PISA reading score of 500 in 2000, for instance, is equivalent to a PISA reading score of 500 in 2006, but a PISA mathematics score from 2000 cannot be equated with a PISA mathematics score from 2003 or 2006.

It is also important to keep in mind that the procedures used to determine scale scores were developed to produce accurate assessment results for groups of students while limiting the testing burden on individual students. They are not intended to produce assessment results for individual students. However, the procedures to determine scale scores provide data that can be readily used in secondary analyses that is done at the student level.

Specifically, during the scaling process, plausible values are estimated to characterize the performance of individual students participating in the assessment. Plausible values are imputed values and not test scores for individuals in the usual sense. In fact, they are biased estimates of the proficiencies of individual students. Plausible values do, however, provide unbiased estimates of population characteristics. Plausible values represent what the true performance of an individual might have been, had it been observed. They are estimated as random draws (usually five) from an empirically derived distribution of score values based on the student's observed responses to assessment items and on background variables. Each random draw from the distribution is considered a representative value from the distribution of potential scale scores for all students in the sample who have similar characteristics and identical patterns of item responses. Differences between the plausible values quantify the degree of precision (the width of the spread) in the underlying distribution of possible scale scores that could have caused the observed performances.

An accessible treatment of the derivation and use of plausible values can be found in Beaton and González (1995). A more technical treatment can be found in the TIMSS 2007 Technical Report (Olson, Martin, and Mullis 2008).