Weighting, Scaling, and Plausible Values

Before the data were analyzed, responses from the groups of students assessed were assigned sampling weights (as described in the next section) to ensure that their representation in the PIRLS and ePIRLS 2016 results matched their actual percentage of the school population in grade 4. With these sampling weights in place, the analyses of PIRLS and ePIRLS 2016 achievement data proceeded in two phases: scaling and estimation. During the scaling phase, item response theory (IRT) procedures were used to estimate the measurement characteristics of each assessment item. During the estimation phase, the results of the scaling were used to produce estimates of student achievement. Subsequent conditioning procedures used the background variables collected by the questionnaires in order to limit bias in the achievement results. Achievement scales are represented as multiple plausible values in the PIRLS and ePIRLS data files, as described further below.

Weighting

Responses from the groups of students were assigned sampling weights to adjust for over- or under-representation during the sampling of a particular group. The use of sampling weights is necessary for the computation of sound, nationally representative estimates. The weight assigned to a student's responses is the inverse of the probability that the student is selected for the sample. When responses are weighted, none are discarded, and each contributes to the results for the total number of students represented by the individual student assessed. Weighting also adjusts for various situations (such as school and student nonresponse) because data cannot be assumed to be randomly missing. The international weighting procedures do not include a poststratification adjustment. The school nonresponse adjustment cells are a cross-classification of each country's explicit stratification variables. The student nonresponse adjustment cells are the student's classroom. All PIRLS and ePIRLS analyses are conducted using sampling weights. A detailed description of the process of creating weights is provided in Methods and Procedures in PIRLS 2016 at https://timssandpirls.bc.edu/publications/pirls/2016-methods.html.

Scaling

In PIRLS and ePIRLS, the propensity of students to answer questions correctly was estimated with

a three-parameter IRT model for multiple choice response items,
a two-parameter IRT model for dichotomous constructed response items (i.e., those with two only response options scored as either correct or incorrect),
a partial credit IRT model for polytomous constructed response items (i.e., those with more than two response options scored as earned points).

The scale scores assigned to each student were estimated using a procedure described below in the "Plausible values" section, with input from the IRT results.

With IRT, the difficulty of each item, or item category, is deduced using information about how likely it is for students to get some items correct (or to get a higher rating on a constructed response item) versus other items. Once the parameters of each item are determined, the ability of each student can be estimated even when different students have been administered different items. At this point in the estimation process, achievement scores are expressed in a standardized logit scale that ranges from -4 to +4. In order to make the scores more meaningful and to facilitate their interpretation, the scores for the first administration of PIRLS (in 2001) were transformed to a scale with a mean of 500 and a standard deviation of 100. Subsequent waves of the assessment are linked to this scale.

To make scores from the second (2006) wave of PIRLS data comparable to the first (2001) wave, two steps were necessary. First, the 2001 and 2006 data for countries and education systems that participated in both years were scaled together to estimate item parameters. Ability estimates for all students (those assessed in 2001 and those assessed in 2006) based on the new item parameters were then estimated. To put these jointly calibrated scores on the 2001 metric, a linear transformation was applied such that the jointly calibrated 2001 scores have the same mean and standard deviation as the original 2001 scores. Such a transformation also preserves any differences in average scores between the 2001 and 2006 waves of assessment.

In order for scores resulting from subsequent waves of assessment (2011 and 2016) to be made comparable to 2001 scores (and to each other), the two steps above are applied sequentially for each pair of adjacent waves of data: two adjacent years of data are jointly scaled, then resulting ability estimates are linearly transformed so that the mean and standard deviation of the prior year is preserved. As a result, the transformed 2016 scores are comparable to all previous waves of the assessment and longitudinal comparisons between all waves of data are meaningful.

To facilitate the joint calibration of scores from adjacent years of assessment, common test items are included in successive administrations. This also enables the comparison of item parameters (difficulty and discrimination) across administrations. If item parameters change dramatically across administrations, they are dropped from the assessment so that scales can be more accurately linked across years. In this way, even if the average ability levels of students in countries and education systems participating in PIRLS changes over time, the scales still can be linked across administrations.

Plausible values

To keep student burden to a minimum, PIRLS and ePIRLS purposefully administered a limited number of all assessment items to each student—too few to produce accurate individual content-related scale scores for each student. The number of assessment items administered to each student, however, is sufficient to produce accurate group scale scores for subgroups of the population. These scores are transformed during the scaling process into plausible values (multiple values for each student) to characterize students participating in the assessment, given their background characteristics. Plausible values are imputed values and not test scores for individuals in the usual sense. If used individually, they provide biased estimates of the proficiencies of individual students. However, when grouped as intended, plausible values provide unbiased estimates of population characteristics (e.g., means and variances for groups).

Plausible values represent what the performance of an individual on the entire assessment might have been, had it been observed. They are estimated as random draws (usually five) from an empirically derived distribution of score values based on the student's observed responses to assessment items and on background variables. Each random draw from the distribution is considered a representative value from the distribution of potential scale scores for all students in the sample who have similar background characteristics and similar patterns of item responses. Differences between plausible values drawn for a single individual quantify the degree of error (the width of the spread) in the underlying distribution of possible scale scores that could have caused the observed performances.

An accessible treatment of the derivation and use of plausible values can be found in Beaton and González (1995)¹⁰. More detailed information can be found in Chapter 11 of Methods and Procedures in PIRLS 2016 at https://timssandpirls.bc.edu/publications/pirls/2016-methods.html.

¹⁰ Beaton, A.E., and Gonzalez, E. (1995). The NAEP Primer. Chestnut Hill, MA: Boston College.