Comparisons made in the text of this report have been tested for statistical significance. For example, in the commonly made comparison of OECD averages to U.S. averages, tests of statistical significance were used to establish whether or not the observed differences from the U.S. average were statistically significant.
In almost all instances, the tests for significance used were standard t tests. These fell into three categories according to the nature of the comparison being made: comparisons of independent samples, comparisons of nonindependent samples, and comparisons of performance over time. In PISA, education system groups are independent. We judge that a difference is "significant" if the probability associated with the t test is less than .05. If a test is significant this implies that difference in the observed means in the sample represents a real difference in the population.6 No adjustments were made for multiple comparisons.
In simple comparisons of independent averages, such as the average score of education system 1 with that of education system 2, the following formula was used to compute the t statistic:
where est1 and est2 are the estimates being compared (e.g., averages of education system 1 and education system 2) and se12 and se12 are the corresponding squared standard errors of these averages. The PISA 2015 data are hierarchical and include school and student data from the participating schools. The standard errors for each education system take into account the clustered nature of the sampled data. These standard errors are not adjusted for correlations between groups since groups are independent.
The second type of comparison occurs when evaluating differences between nonindependent groups within the education system. Because of the sampling design in which schools and students within schools are randomly sampled, the data within the education system from mutually exclusive sets of students (for example, males and females) are not independent. For example, to determine whether the performance of females differs from that of males would require estimating the correlation between females' and males' scores. A BRR procedure, mentioned above, was used to estimate the standard errors of differences between nonindependent samples within the United States. Use of the BRR procedure implicitly accounts for the correlation between groups when calculating the standard errors.
To test comparisons between nonindependent groups the following t statistic formula was used:
where estgrp1 and estgrp2 are the nonindependent group estimates being compared and se(grp1-grp2) is the standard error of the difference calculated using BRR to account for the correlation between the estimates for the two nonindependent groups.
A third type of comparison—the addition of a standard error term to the standard t test shown above for simple comparisons of independent averages—was also used when analyzing change in performance over time. The transformation that was performed to equate the 2015 data with previous data depends upon the change in difficulty of each of the individual link items and as a consequence the sample of link items that have been chosen will influence the choice of transformation. This means that if an alternative set of link items had been chosen the resulting transformation would be slightly different. The consequence is an uncertainty in the transformation due to the sampling of the link items, just as there is an uncertainty in values such as country means due to the use of a sample of students. This uncertainty that results from the link item sampling is referred to as "linking error," and this error must be taken into account when making certain comparisons between previous rounds of PISA (2003, 2006, 2009, and 2012) and PISA 2015 results. Just as with the error that is introduced through the process of sampling students, the exact magnitude of this linking error cannot be determined. We can, however, estimate the likely range of magnitudes for this error and take this error into account when interpreting PISA results. As with sampling errors, the likely range of magnitude for the errors is represented as a standard error. The standard errors of linking for the various PISA rounds and subjects are:
|† Not applicable. Science trend comparisons can only be made as far back as 2006 due to a change in the framework.|
In PISA, in each of the three subject matter areas, a common transformation was estimated from the link items, and this transformation was applied to all participating education systems when comparing achievement scores over time. It follows that any uncertainty that was introduced through the linking is common to all students and all education systems. Thus, for example, suppose the unknown linking error (between PISA 2012 and PISA 2015) in reading literacy resulted in an over-estimation of student scores by five and one-fourth points on the PISA 2012 scale. It follows that every student's score will be over-estimated by five and one fourth score points. This over-estimation will have effects on certain, but not all, summary statistics computed from the PISA 2015 data. For example, consider the following:
In general terms, the linking error need only be considered when comparisons are being made between PISA 2012 and PISA 2015 results, and then usually only when group means are being compared. Because the linking error need only be used in a limited range of situations, we have chosen not to report the linking error in the tables included in this report. The general formula is given by:
The most obvious example of a situation where there is a need to use linking error is in the comparison of the mean performance for a single education system between PISA 2012 and PISA 2015. For example, let us consider a comparison between 2012 and 2015 of the performance of the United States in reading. The mean performance of the United States in 2012 was 498 with a standard error of 3.7, while in 2015 the mean was 497 with a standard error of 3.4. Using rounded mean values, the standardized difference in the U.S. means is 0.138, which is computed as follows:
0.138 = (498 – 497) / SQRT[3.72 + 3.42 + 5.25352]
and is not statistically significant.
6 A .05 probability implies that the t statistic is among the 5 percent most extreme values one would expect if there were no difference between the means. The decision rule is that when t statistics are this extreme, they are sampled from a population in which there is a difference between the means.