Progress in International Reading Literacy Study (PIRLS)

5. Data Quality and Comparability

Comparisons made in PIRLS (e.g. education systems' averages compared to the U. S. average) are tested for differences using statistical significance, which requires the estimation of standard errors. However, the estimation of correct standard errors is complicated by the complex sample and assessment designs of PIRLS: both the sample design and assessment design generate error variance and mandate a set of statistically complex procedures. For PIRLS, estimates produced using the data are subject to two types of error—nonsampling and sampling error. Nonsampling error can be due to errors made in collecting and processing data. Sampling error can occur because the data were collected from a sample rather than a complete census of the population.

Sampling Error

Sampling errors arise when a sample of the population, rather than the whole population, is used to estimate a statistic. Different samples from the same population would likely produce somewhat different estimates of the statistic in question. This means that there is a degree of uncertainty associated with statistics estimated from a sample. This uncertainty, or sampling variance, is usually expressed as the standard error of a statistic estimated from sample data. For PIRLS, there is the additional complexity of the multi-stage cluster and assessment matrix sampling designs, which result in estimated standard errors containing both a sampling variance component—estimated by a jackknife repeated replication (JRR) procedure—and an additional imputation variance component arising from the assessment design.

The matrix sampling design assigns a single test assessment booklet containing only a portion of the PIRLS assessment to each individual student. Using the scaling techniques described above, results are aggregated across all booklets to provide results for the entire assessment, with plausible values being generated as estimates of student performance on the assessment as a whole. The variability among these are combined with the sampling error for that variable, to provide a standard error that incorporates both error components. The correctly estimated standard errors are then used to conduct t-tests that compare other education system averages to the U.S. average, for example, and to construct confidence intervals.

Confidence intervals provide a way to make inferences about population statistics in a manner that reflects the sampling error associated with the statistic. Assuming a normal distribution, the population value of this statistic can be inferred to lie within a 5-percent confidence interval in 95 out of 100 replications of the measurement on different samples drawn from the same population. For example, the average reading score for U.S. fourth-grade students was 549 in 2016, and this statistic had a standard error of 3.1. Therefore, it can be stated with 95 percent confidence that the actual average of U.S. fourth-grade students in 2016 was between 543 and 555.

Nonsampling Error

Nonsampling error is a term used to describe variations in the estimates that may be caused by population coverage limitations, nonresponse bias, and measurement error, as well as data collection, processing, and reporting procedures. The sources of nonsampling error are typically problems like unit and item nonresponse, the difference in respondents' interpretations of the meaning of the survey questions, response differences related to the particular time the survey was conducted, and mistakes in data preparation.

One strategy implemented by PIRLS to reduce nonresponse bias is the a priori identification of replacement schools. Ideally, response rates to study samples should always be 100 percent, and although the PIRLS administrators worked hard to achieve this goal, it was anticipated that a 100 percent participation rate would not be possible in all countries. To avoid sample size losses, the PIRLS sampling plan identified, a priori, replacement schools for each sampled school. Therefore, if an originally selected school refused to participate in the study, it was possible to replace it with a school that already was identified prior to school sampling. Replacement schools always belonged to the same explicit stratum, although they could come from different implicit strata if the originally selected school was either the first or last school of an implicit stratum. Although the use of replacement schools did not eliminate the risk of nonresponse bias, employing implicit stratification and ordering the school sampling frame by size increased the chances that any sampled school's replacements would have similar characteristics. This approach maintains the desired sample size while restricting replacement schools to strata where nonresponse occurred.

IEA-developed participation or response rate standards are next applied. These standards were set using composites of response rates at the school, classroom, and student and teacher levels, and response rates were calculated with and without the inclusion of the replacement/substitute schools. These standards took the following two forms for 2016: Category 1-education system met the standards, having 85 percent minimum school and student participation rates and 95 percent classroom participation rates; and Category 2-education system met the standards after substitution. Countries satisfying the category 1 standard are included in the international tabular presentations without annotation. Those able to satisfy only the category 2 standard are included as well but are annotated to indicate their response rate status. The data from education systems failing to meet either standard (identified as Category 3 in previous PIRLS administrations) are presented separately in the international tabular presentations. Table PIRLS-1 displays response rates for the U.S. for the 2001, 2006, 2011, and 2016 administrations of PIRLS and ePIRLS.

Data Comparability

From its inception, PIRLS was designed to measure trends in reading literacy achievement. Many of the countries participating in PIRLS 2016 also participated in the previous study cycles in 2001, 2006, and 2011. As a result, these countries have the opportunity to measure progress in reading achievement across four time points: 2001, 2006, 2011, and 2016. In order to ensure comparability of the data across participating education systems, the IEA provides detailed international requirements on the various aspects of data collection, and implements quality control procedures. Participating countries are obliged to follow these requirements, which pertain to target populations, sampling design, sample size, exclusions, and defining participation rates.

In the United States, data used by NCES on fourth-grade students' reading achievement comes primarily from two sources: NAEP and PIRLS. There are distinctive differences between PIRLS and NAEP. A comparative study was conducted of PIRLS 2011 and NAEP 2009/2011, which overall suggested that the NAEP 2011 reading assessment may be more cognitively challenging than PIRLS 2011 for U.S. fourth-grade students and that caution should be exercised when attempting to compare fourth-grade students' performance on PIRLS 2011 with fourth-grade students' performance on the NAEP 2011 reading assessment.

For more information on the similarities and differences between PIRLS and NAEP, see A Content Comparison of the NAEP and PIRLS Fourth-Grade Reading Assessments (Binkley and Kelly 2003), and Comparing PIRLS and PISA with NAEP in Reading, Mathematics, and Science (Stephens and Coleman, 2007).

Table PIRLS-1. Weighted U.S. response rates for PIRLS assessments: 2001, 2006, 2011, and 2016
Year School response rate Student response rate Overall response rate
2001 86 96 83
2006 86 95 82
2011 85 96 81
2016 main assessment 92 94 86
2016 ePIRLS 89 90 80
NOTE: All weighted response rates refer to final adjusted weights. Response rates were calculated using the formula developed by the IEA for PIRLS. The standard NCES formula for computing response rates would result in a lower school response rate. Response rates are after replacement.
SOURCE: PIRLS methodology reports; available at