Each test booklet or computerized version had a different subset of items. The fact that each student completed only a subset of items means that classical test scores, such as the percentage correct, are not accurate measures of student performance. Instead, scaling techniques were used to establish a common scale for all students. For PISA 2012, item response theory (IRT) was used to estimate average scores for mathematics, science, and reading literacy for each education system, as well as for three mathematics process and four mathematics content scales. For education systems participating in the financial literacy assessment and the computer-based assessment, these assessments will be scaled separately and assigned separate scores.
IRT identifies patterns of response and uses statistical models to predict the probability of answering an item correctly as a function of the students’ proficiency in answering other questions. With this method, the performance of a sample of students in a subject area or subarea can be summarized on a simple scale or series of scales, even when students are administered different items.
Scores for students are estimated as plausible values because each student completed only a subset of items. Five plausible values were estimated for each student for each scale. These values represent the distribution of potential scores for all students in the population with similar characteristics and identical patterns of item response. Statistics describing performance on the PISA reading, mathematics, and science literacy scales are based on plausible values.