Skip Navigation

Statistical Procedures

Tests of significance

The comparisons presented in The PIRLS and ePIRLS Results from 2016 pages and the in the PIRLS and ePIRLS 2016 First Look report have been tested for statistical significance. For example, in the commonly made comparison of international averages to U.S. averages, tests of statistical significance were used to establish whether or not the observed differences from the U.S. average were statistically significant. The tests for significance used were standard t tests. These test the likelihood that a difference between two values (e.g., means or percentages) was larger than would be expected by sampling variance induced by the study design.

In simple comparisons of independent averages, such as the U.S. average with other education systems' averages, the following formula was used to compute the t statistic:

This formula shows to compute the t statistic by subtracting E subscript 2 from E subscript 1, divided by the square root of se superscript 2 subscript 1 plus se superscript 2 subscript 2. Further detail of this formula is provided in the text.

E1 and E2 are the two estimates being compared (e.g., the U.S. average and the average of another education system), and se1 and se2 are the corresponding standard errors of these averages. Whether a difference is considered statistically significant is determined by comparing this t value or "test statistic" with tables of t values or "critical values" and their corresponding alpha levels. The alpha level is an a priori statement of the probability of inferring that a difference exists when, in fact, it does not. The alpha level used in all tests is .05; differences discussed in the First Look report or marked in a table or figure have been tested and found significant at this level. Two-tailed tests were performed, and no adjustments for multiple comparisons were made.

When a country is compared to the international average, there is an overlap between the samples in the sense that the country is part of the international group. Similarly, when a subgroup (such as males or females) is compared to the overall average (such as the overall U.S. average), the subgroup is part of the overall group. These are referred to as part-whole comparisons. In such comparisons, the following formula was used to compute the t statistic:

This formula shows to compute the t statistic by subtracting Est subscript 2 from Est subscript 1, divided by se multiplied by the difference of Est subscript 1 minus Est subscript 2. Further detail of this formula is provided in the text.

where Est1 and Est2 are the nonindependent group estimates being compared, and se(Est1 - Est2) is the standard error of the difference between the two estimates. As with the test for independent groups, two-tailed tests were used for part-whole comparisons (non-independent groups), and no adjustments for multiple comparisons were made.

For both formulas, the estimation of the standard errors is complicated by the complex sample and assessment design, both of which generate error variance. Together they mandate a set of statistically complex procedures in order to estimate the correct standard errors. To estimate the standard errors correctly, a replication method (using supplied replicate weights) called jackknife repeated replication (JRR) was used. For the test for non-independent groups, use of the replicate weights implicitly accounts for the covariance between the two estimates (e.g., means or averages) as part of the estimate of the standard error on the difference. For the test for independent groups, the expected value of this covariance will be zero.