Skip Navigation

Weighting, scaling, and plausible values


Before the data were analyzed, responses from the groups of students assessed were assigned sampling weights (as described in the next section) to ensure that their representation in the TIMSS and TIMSS Advanced 2015 results matched their actual percentage of the school population in the grade assessed. With these sampling weights in place, the analyses of TIMSS 2015 data proceeded in two phases: scaling and estimation. During the scaling phase, item response theory (IRT) procedures were used to estimate the measurement characteristics of each assessment question. During the estimation phase, the results of the scaling were used to produce estimates of student achievement. Subsequent conditioning procedures used the background variables collected by TIMSS and TIMSS Advanced in order to limit bias in the achievement results.

Responses from the groups of students were assigned sampling weights to adjust for over- or under-representation during the sampling of a particular group. The use of sampling weights is necessary for the computation of sound, nationally representative estimates. The weight assigned to a student's responses is the inverse of the probability that the student is selected for the sample. When responses are weighted, none are discarded, and each contributes to the results for the total number of students represented by the individual student assessed. Weighting also adjusts for various situations (such as school and student nonresponse) because data cannot be assumed to be randomly missing. The international weighting procedures do not include a poststratification adjustment. The school nonresponse adjustment cells are a cross-classification of each country's explicit stratification variables. The student nonresponse adjustment cells are the student's classroom. All TIMSS 1995, 1999, 2003, 2007, 2011, and 2015 analyses are conducted using sampling weights. All TIMSS Advanced 1995 and 2015 analyses are also conducted using sampling weights. A detailed description of this process is provided in Chapter 3 of Methods and Procedures in TIMSS 2015 at For 2015, though the national and Florida samples share schools, the samples are not identical school samples and, thus, weights are estimated separately for the national and Florida samples.

In TIMSS, the propensity of students to answer questions correctly was estimated with

  • a two-parameter IRT model for dichotomous constructed response items,
  • a three-parameter IRT model for multiple choice response items, and
  • a generalized partial credit IRT model for polytomous constructed response items.

The scale scores assigned to each student were estimated using a procedure described below in the “Plausible values” section, with input from the IRT results.

With IRT, the difficulty of each item, or item category, is deduced using information about how likely it is for students to get some items correct (or to get a higher rating on a constructed response item) versus other items. Once the parameters of each item are determined, the ability of each student can be estimated even when different students have been administered different items. At this point in the estimation process achievement scores are expressed in a standardized logit scale that ranges from -4 to +4. In order to make the scores more meaningful and to facilitate their interpretation, the scores for the first year (1995) were transformed to a scale with a mean of 500 and a standard deviation of 100. Subsequent waves of assessment are linked to this metric (as described below).

To make scores from the second (1999) wave of TIMSS data comparable to the first (1995) wave, two steps were necessary. First, the 1995 and 1999 data for countries and education systems that participated in both years were scaled together to estimate item parameters. Ability estimates for all students (those assessed in 1995 and those assessed in 1999) based on the new item parameters were then estimated. To put these jointly calibrated 1995 and 1999 scores on the 1995 metric, a linear transformation was applied such that the jointly calibrated 1995 scores have the same mean and standard deviation as the original 1995 scores. Such a transformation also preserves any differences in average scores between the 1995 and 1999 waves of assessment.

In order for scores resulting from subsequent waves of assessment (2003, 2007, 2011, and 2015) to be made comparable to 1995 scores (and to each other), the two steps above are applied sequentially for each pair of adjacent waves of data: two adjacent years of data are jointly scaled, then resulting ability estimates are linearly transformed so that the mean and standard deviation of the prior year is preserved. As a result, the transformed-2015 scores are comparable to all previous waves of the assessment and longitudinal comparisons between all waves of data are meaningful.

To facilitate the joint calibration of scores from adjacent years of assessment, common test items are included in successive administrations. This also enables the comparison of item parameters (difficulty and discrimination) across administrations. If item parameters change dramatically across administrations, they are dropped from the current assessment so that scales can be more accurately linked across years. In this way even if the average ability levels of students in countries and education systems participating in TIMSS changes over time, the scales still can be linked across administrations.

Scaling for TIMSS Advanced follows a similar process, using data from the 1995, 2008, and 2015 administrations.

Plausible values
To keep student burden to a minimum, TIMSS and TIMSS Advanced purposefully administered a limited number of assessment items to each student—too few to produce accurate individual content-related scale scores for each student. The number of assessment items administered to each student, however, is sufficient to produce accurate group content-related scale scores for subgroups of the population. These scores are transformed during the scaling process into plausible values to characterize students participating in the assessment, given their background characteristics. Plausible values are imputed values and not test scores for individuals in the usual sense. If used individually, they provide biased estimates of the proficiencies of individual students. However, when grouped as intended, plausible values provide unbiased estimates of population characteristics (e.g., means and variances for groups).

Plausible values represent what the performance of an individual on the entire assessment might have been, had it been observed. They are estimated as random draws (usually five) from an empirically derived distribution of score values based on the student's observed responses to assessment items and on background variables. Each random draw from the distribution is considered a representative value from the distribution of potential scale scores for all students in the sample who have similar background characteristics and similar patterns of item responses. Differences between plausible values drawn for a single individual quantify the degree of error (the width of the spread) in the underlying distribution of possible scale scores that could have caused the observed performances.

An accessible treatment of the derivation and use of plausible values can be found in Beaton and González (1995)10 . More detailed information can be found in the Methods and Procedures in TIMSS 2015 at and Methods and Procedures in TIMSS Advanced 2015 at

10 Beaton, A.E., and Gonzalez, E. (1995). The NAEP Primer. Chestnut Hill, MA: Boston College.