Comparisons made in the text of this report have been tested for statistical significance. For example, in the commonly made comparison of OECD averages to U.S. averages, tests of statistical significance were used to establish whether or not the observed differences from the U.S. average were statistically significant.
In almost all instances, the tests for significance used were standard t tests. These fell into two categories according to the nature of the comparison being made: comparisons of independent samples and comparisons of nonindependent samples. In PISA, education system groups are independent. We judge that a difference is “significant” if the probability associated with the t test is less than .05. If a test is significant this implies that difference in the observed means in the sample represents a real difference in the population.1 No adjustments were made for multiple comparisons.
In simple comparisons of independent averages, such as the average score of education system 1 with that of education system 2, the following formula was used to compute the t statistic:
where est1 and est2 are the estimates being compared (e.g., averages of education system 1 and education system 2) and se12 and se22 are the corresponding squared standard errors of these averages. The PISA 2012 data are hierarchical and include school and student data from the participating schools. The standard errors for each education system take into account the clustered nature of the sampled data. These standard errors are not adjusted for correlations between groups since groups are independent.
The second type of comparison occurs when evaluating differences between nonindependent groups within the education system. Because of the sampling design in which schools and students within schools are randomly sampled, the data within the education system from mutually exclusive sets of students (for example, males and females) are not independent. As a result, to determine whether the performance of females differs from the performance of males, for example, the standard error of the difference taking into account the correlation between females’ scores and males’ scores needs to be estimated. A BRR procedure, described above, was used to estimate the standard errors of differences between nonindependent samples within the United States. Use of the BRR procedure implicitly accounts for the correlation between groups when calculating the standard errors.
To test comparisons between nonindependent groups the following t statistic formula was used:
where estgrp1 and estgrp2 are the nonindependent group estimates being compared and se(grp1-grp2) is the standard error of the difference calculated using BRR to account for the correlation between the estimates for the two nonindependent groups.