Skip Navigation

Scaling of Student Test Data

PreviousReturn to Methodology and Technical Notes

Each test form had a different subset of items. Because each student completed only a subset of all possible items, classical test scores, such as the percentage correct, are not accurate measures of student performance. Instead, scaling techniques were used to establish a common scale for all students. For PISA 2015, item response theory (IRT) was used to estimate average scores for science, reading, and mathematics literacy for each education system, as well as for three science process and three science content subscales. For education systems participating in the financial literacy assessment and the collaborative problem solving assessment, these assessments were scaled separately and assigned separate scores.

IRT identifies patterns of response and uses statistical models to predict the probability of answering an item correctly as a function of the students' proficiency in answering other questions. With this method, the performance of a sample of students in a subject area or subarea can be summarized on a simple scale or series of scales, even when students are administered different items.

Scores for students were estimated as plausible values because each student completed only a subset of items. Ten plausible values were estimated for each student for each scale. These values represented the distribution of potential scores for all students in the population with similar characteristics and identical patterns of item response. Statistics describing performance on the PISA science, reading, and mathematics scales are based on plausible values. In PISA, the science, mathematics and reading literacy scales are from 0-1,000.