Fourth-grade student population. The target population for PIRLS 2001 was defined as all students enrolled in the upper of the two adjacent grades that contain the largest proportion of 9-year-olds at the time of testing. This target grade was usually the fourth grade of primary school. Because fourth grade generally signals the completion of formal reading instruction, countries for which the target grade would have been the third grade were permitted to retain the fourth grade as their target grade. The PIRLS 2001 target population was derived from that used by TIMSS in 1995 and was identical to that used by TIMSS 2003 at the primary school level.
For PIRLS 2006, the target population was defined as all students enrolled in the fourth grade of formal schooling, counting from the first year of primary school as defined by the United Nations Educational, Scientific, and Cultural Organization (UNESCO) International Standard Classification for Education (ISCED). Accordingly, the fourth year of formal schooling was the fourth grade in most countries.
The target population for PIRLS 2011 was all students enrolled in the grade that represents four years of schooling, counting from the first year of ISCED Level 1. ISCED provides an international standard for describing levels of schooling across countries. The ISCED system describes the full range of schooling, from preprimary (Level 0) to the second level of tertiary education (Level 6). ISCED Level 1 corresponds to primary education or the first stage of basic education. Four years later would be the PIRLS target grade, which is the fourth grade in most countries. However, given the linguistic and cognitive demands of reading, PIRLS is designed to avoid assessing very young children. Thus, countries were recommended to assess the next higher grade (i.e., fifth grade) if the average age of fourth grade students at the time of testing was less than 9.5 years.
For most countries participating in PIRLS 2011, the target grade was fourth grade. However, in England, Malta, New Zealand, and Trinidad and Tobago, children begin primary school at an early age. Therefore, these countries assessed students in the fifth year of schooling and their students were still among the youngest in PIRLS 2011.
Several new initiatives were introduced in 2011 that affected the target population in several countries. One new initiative was prePIRLS, which was developed as a less difficult version of PIRLS to provide more assessment options for developing countries where students may not be prepared for the demands of PIRLS. prePIRLS was based on the same view of reading comprehension as PIRLS but was designed to assess basic reading skills that were a prerequisite for success on PIRLS. Botswana, Colombia, and South Africa administered prePIRLS to their fourth grade students. Colombia also administered PIRLS to the same fourth grade students, providing a basis for a link between the PIRLS and prePIRLS scales. As well, in 2011, PIRLS was given to students in the fifth or sixth grades in countries where the assessment might be too difficult for their fourth grade students. Accordingly, Botswana, Honduras, Kuwait, and Morocco chose to administer PIRLS in both the sixth and fourth grades.
The target population for PIRLS 2016 was the same as for 2011. The PIRLS 2016 cycle also included PIRLS Literacy—a new, less difficult reading literacy assessment, and ePIRLS—an extension of PIRLS with a focus on online informational reading.
Teacher population. The target teacher population consists of all teachers linked to the selected students. Note that these teachers are therefore not a representative sample of teachers within an education system. Rather, they are the teachers who teach a representative sample of students in grade 4 within the education system.
School population. The target school population consists of all eligible schools containing one or more fourth-grade classrooms.
PIRLS uses a two-stage stratified cluster sample design. The first stage consists of a sample of schools, which may be stratified; the second stage consists of a sample of one or more classrooms from the target grade in sampled schools.
First-stage sampling selects individual schools with a probability proportionate to size (PPS) approach, which means that the probability is proportional to the estimated number of students enrolled in the target grade. Substitution schools are also selected to replace any schools that are originally sampled but refuse to participate. The original and substitution schools are selected simultaneously. In the second stage of sampling, one or two fourth-grade classes are randomly sampled in each school.
PIRLS guidelines call for a minimum of 150 schools to be sampled in each education system, with a minimum of 4,000 students assessed. A sample of 150 schools yields 95 percent confidence limits for school-level and classroom-level mean estimates that are precise to within 16 percent of their standard deviations. Countries with small class sizes or less than 30 students per school are directed to consider sampling more schools, more classrooms per school, or both, to meet the minimum target of 4,000 tested students. For countries choosing to participate in both PIRLS and PIRLS Literacy, the required student sample size is doubled—i.e., around 8,000 sampled students. Countries could choose to select more schools or more classes within sampled schools to achieve the required sample size. Because ePIRLS is designed to be administered to students also taking PIRLS, the PIRLS sample size requirement remains the same for countries choosing also to participate in ePIRLS.
In the United States, the PIRLS 2001 sample consisted of 3,763 fourth-grade students from 174 schools (after substitution). In 2006, the U.S. sample consisted of 5,190 fourth-grade students from 183 schools (after substitution).
For the 2011 data collection, there were 370 U.S. schools, after substitution, consisting of 12,726 fourth-grade students. The reason for a larger sample in 2011 was due to the coinciding administration of the Trends in International Mathematics and Science Study (TIMSS). To accommodate this concurrent administration, schools with at least two grade 4 classrooms were asked to participate in both studies, with one classroom being randomly assigned to TIMSS and the other to PIRLS.
In the United States, one sample was drawn to represent the nation at grade 4 for PIRLS 2011. In addition to this national sample, a state public school sample was also drawn at grade 4 for Florida, which chose to participate in PIRLS separately from the nation in order to benchmark their student performance internationally. The sample frame for public schools in the United States was based on the 2011 National Assessment of Educational Progress (NAEP) sampling frame. The 2011 NAEP sampling frame was based on the 2007–08 Common Core of Data (CCD). The PIRLS 2011 data for private schools were from the 2007–08 Private School Universe Survey (PSS). Any school containing at least one grade 4 class was included in the school sampling frame.
The U.S. PIRLS 2016 national school sample consisted of 176 schools, which was higher than the international sampling minimum of 150 to offset anticipated school nonresponse and ineligibility. A total of 158 U.S. schools agreed to participate in PIRLS 2016, including 131 from the original sample and 27 sampled as replacements for nonparticipating schools from the original sample. Of the 158 U.S. schools that participated in PIRLS, 153 also participated in ePIRLS. In total, 4,425 U.S. students participated in PIRLS and 4,090 of these students also participated in ePIRLS.
The U.S. sampling frame was explicitly stratified by three categorical variables: Poverty status (high or low, defined by percentage of students eligible for free or reduced-price lunch); type of school (public or private); and region of the country (Northeast, Central, West, Southeast). The U.S. sample was implicitly stratified (that is, sorted for sampling) by two categorical variables: locality (four levels) and minority status (above or below 15 percent of the student population).
PIRLS is sponsored by the IEA and carried out under a contract with the TIMSS & PIRLS International Study Center and the data collection contractor. The National Center for Education Statistics, in the Institute of Education Sciences at the U.S. Department of Education, is responsible for the implementation of PIRLS in the United States. PIRLS emphasizes the use of standardized procedures in all participating education systems, so that each education system collected its own data based on comprehensive manuals and training materials. These materials explain the survey's implementation, including precise instructions for the work of school coordinators and scripts for test administrators to use in testing sessions. The International Study Center monitors compliance with the standardized procedures.
The PIRLS 2001 instruments were translated into 35 languages. The PIRLS 2006 instruments were again prepared in English and then translated into 45 languages. Although most countries administer the assessment in just one language, there have been some exceptions. For example, in 2006, nine countries plus the five Canadian provinces administered PIRLS in two languages, Spain administered the assessment in its five official languages, and South Africa administered the assessment in eleven languages. To ensure comparability among translated instruments, the International Study Center established guidelines and reviewed and approved all adaptions. For PIRLS 2011, the assessment was translated into 45 different languages.
The PIRLS 2016 assessment instruments were translated into 40 different languages, across 50 participating countries and 6 benchmarking entities, the PIRLS Literacy assessment instruments were translated into 10 languages across 6 countries, and the ePIRLS assessment instruments were translated into 14 languages across 14 countries and 2 benchmarking entities. Of these participants, 24 countries and 4 benchmarking entities administered the instruments in more than one language.
The IEA provides overall support in coordinating PIRLS. The Secretariat, located in Amsterdam, has particular responsibility for membership, translation verification, and hiring the quality control monitors. The Data Processing and Research Center, located in Hamburg, is responsible for the accuracy and consistency of the PIRLS database within and across countries.
Reference dates. PIRLS is administered near the end of the school year in each education system. For PIRLS 2001, in education systems in the Northern Hemisphere where the school year typically ends in May or June, the assessment was conducted in April, May, or June 2001. In the Southern Hemisphere where the school year typically ends in November or December, the assessment was conducted in October or November 2001.
For PIRLS 2006, education systems in the Northern Hemisphere conducted the assessment between March and May 2006. In the United States, data collection began slightly earlier and ended in early June. In the Southern Hemisphere the assessment was conducted in October 2005.
For PIRLS 2011, the education systems in the Southern Hemisphere conducted the study between October and December 2010. Education systems in the Northern Hemisphere conducted the assessment between March and June 2011.
For PIRLS 2016, the education systems in the Southern Hemisphere conducted the study between October and December 2015. Education systems in the Northern Hemisphere conducted the assessment between March and June 2016.
Data collection and cleaning. Each country was responsible for carrying out all aspects of the data collection by using standardized procedures developed for the study. Manuals provided explicit instructions to the NRCs and their staff members on all aspects of the data collection from contacting sampled schools to packing and shipping materials to the IEA Data Processing Center for processing and verification.
The International Study Center monitored compliance with the standardized procedures. NRCs were asked to nominate one or more persons unconnected with their national center, such as retired school teachers, to serve as quality control monitors for their education systems. The International Study Center developed manuals for the monitors and briefed them in 2-day training sessions about PIRLS, the responsibilities of the national centers in conducting the study, and their own roles and responsibilities. For the 2001 PIRLS test administration, 15 schools in each country were observed. For 2006, ten percent of the schools' test administrations were visited by monitors, and for PIRLS 2011, some 30 of the 370 schools in the sample were visited by monitors. For PIRLS 2016, International Quality Control Monitors observed 814 PIRLS/PIRLS Literacy testing sessions and 209 ePIRLS testing sessions.
The NRC in each education system was responsible for the scoring and coding of data in that education system, following established guidelines. The NRC and, sometimes, additional staff attended scoring training sessions held by the International Study Center. The training sessions focused on the scoring rubrics and coding system employed in PIRLS. Participants in these training sessions were provided extensive practice in scoring example items over several days. Information on within-education-system agreement among coders was collected and documented by the International Study Center. Information on scoring and coding reliability was also used to calculate cross-education-system agreement among coders.
The NRC from each education system was responsible for data entry. In the United States, the data collection contractor collected data for PIRLS 2016 and entered the data into data files with a pre-specified, common international format. IEA-supplied data-entry software (WinDEM) facilitated the checking and correction of data by providing various data consistency checks. The data were then sent to the IEA Data Processing Center (DPC) in Hamburg, Germany, for cleaning. The DPC checked that the international data structure was followed; checked the identification system within and between files; corrected single case problems manually; and applied standard cleaning procedures to questionnaire files. Results of the data cleaning process were documented by the DPC. This documentation was then sent to the NRC along with any remaining questions about the data. The NRC then provided the DPC with revisions to coding or solutions for anomalies. The DPC subsequently compiled background univariate statistics and preliminary test scores based on classical item analysis and item response theory (IRT).
Before the collected data are analyzed, student records are assigned sampling weights to ensure that student representation in the PIRLS analysis closely matches the prevalence of groups in the student population for the grade assessed. Under the PIRLS sample design, schools and students have unequal but known probabilities of selection; as a consequence, file-supplied sampling weights must be applied to analysis and subsequent results, in order to generalize to the population.
After sample weights are assigned, scaling and estimation can be conducted. During the scaling phase, IRT procedures are used to estimate the measurement characteristics of each assessment question. During the estimation phase, the results of the scaling are used to produce estimates of student achievement. Subsequent analyses relate the achievement results to the background variables collected by PIRLS.
Weighting. Students are assigned sampling weights to adjust for over- or under-representation of particular groups in the final sample. When students are weighted, none of the data are discarded and each student contributes to the results for the total number of students represented. The weight assigned to a student is therefore the inverse of the probability that the student was selected for the sample. The use of sampling weights is necessary for the computation of sound, nationally representative estimates. Weighting also adjusts for various situations such as school and student nonresponse because data cannot be assumed to be randomly missing. All PIRLS 2001, 2006, 2011, and 2016 analyses are conducted using sampling weights and are calculated according to a three-step procedure involving selection probabilities for schools, classrooms, and students.
School weight. The first step consists of calculating a school weight, which also incorporates weighting factors from any additional front-end sampling stages, such as regions. A school level participation adjustment is then made in the school weight to compensate for any sampled schools that did not participate and were not replaced. That adjustment is calculated independently for each explicit stratum.
Classroom weight. In the second step, a classroom weight reflecting the probability of the sampled classroom(s) being selected from among all the classrooms in the school at the target grade level is calculated. This weight is calculated independently for each participating school. If a sampled classroom in a school did not participate, or if the participation rate among students in a classroom fell below 50 percent, a classroom-level participation adjustment is made to the classroom weight. Classroom participation adjustment could occur only within "participating schools" (a school was considered as a "participating school" if and only if there was at least one sampled classroom with at least 50 percent of its students participating in the study). If one of at least two selected classrooms in a school did not participate, the classroom participation adjustment is computed at the explicit stratum level, rather than at the school level, to reduce the risk of bias.
Student weight. The third and final step consists of calculating a student weight. For most PIRLS participants, intact classrooms are sampled, so each student in the sampled classrooms is certain of selection, making the student weight 1.0. When students are further sampled within classrooms, a student weight reflecting the probability of the sampled students being selected within the classroom is calculated. A nonparticipation adjustment is then made to adjust for sampled students who did not take part in the testing. This adjustment is calculated independently for each sampled classroom.
Overall (basic) sampling weight. The overall student sampling weight is the product of the three weights just described and includes any nonparticipation adjustments that were made.
Scaling. The primary approach to reporting PIRLS achievement data is based on IRT scaling methods. The IRT analysis provides a common scale on which performance can be compared across countries. Student reading achievement is summarized using a family of IRT models. The IRT methodology is preferred for developing comparable estimates of performance for all students, since students respond to different passages and items depending upon which of the test booklets they receive. This methodology produces a score by averaging the item responses of each student, taking into account the difficulty and discriminating ability of each item. To enable comparisons across PIRLS assessments, common test items are included in successive administrations, and any item parameters that change dramatically are treated as unique items.
The propensity of students to answer questions correctly is estimated for PIRLS using a two-parameter IRT model for dichotomous constructed response items, a three-parameter IRT model for multiple choice response items, and a generalized partial credit IRT model for polytomous constructed-response items. The scale scores assigned to each student were estimated using a plausible values procedure, with input from the IRT results. With IRT, the difficulty of each item, or item category, is deduced using information about how likely it is for students to get some items correct (or to get a higher rating on a constructed response item) versus other items. Once the parameters of each item are determined, the ability of each student can be estimated even when different students have been administered different items. At this point in the estimation process achievement scores are expressed in a standardized logit scale. In order to make the scores more meaningful and to facilitate their interpretation, the scores for the PIRLS 2001 assessment are transformed to a scale with a mean of 500 and a standard deviation of 100.
To make PIRLS 2006 scores comparable to 2001 scores, the 2001 and 2006 data for countries that participated in both years were first scaled together, to estimate item parameters. Ability estimates for all students in the 2001 and 2006 assessment were then estimated based on the new item parameters. A linear transformation was then applied to put these estimates on the 2001 metric so that the jointly calibrated 2001 scores have the same mean and standard deviation as the original 2001 scores. This also preserves any differences in average scores between the 2001 and 2006 waves of assessment.
To make PIRLS 2011 scores comparable to 2001, these steps are repeated for each pair of 2006 and 2011 data: two adjacent years of data are jointly scaled, then resulting ability estimates are linearly transformed so that the mean and standard deviation of the prior year is preserved. As a result, the transformed 2011 scores are comparable to all previous waves of assessment and longitudinal comparisons between all waves of data are meaningful.
To provide results for the PIRLS 2016 assessment on the PIRLS achievement scales, the 2016 proficiency scores (plausible values) for overall reading had to be transformed to the PIRLS reporting metric. This was accomplished through a set of linear transformations as part of the concurrent calibration approach. The linear transformation constants were obtained by first computing the international means and standard deviations of the proficiency scores for the overall reading scale using the plausible values produced in 2011 based on the 2011 item calibrations for the trend countries. These were the plausible values published in 2011. Next, the same calculations were done using the plausible values from the re-scaled PIRLS 2011 assessment data based on the 2016 concurrent item calibration for the same set of countries. There are five sets of transformation constants for the PIRLS reading scale, one for each plausible value. The trend countries contributed equally in the calculation of these transformation constants. These linear transformation constants were applied to the overall reading proficiency scores and for all participating countries and benchmarking participants. This provided student achievement scores for the PIRLS 2016 assessment that are directly comparable to the scores from all previous assessments.
Much like the normal PIRLS scaling procedure, the PIRLS Literacy scaling approach involved the same four tasks of calibrating the achievement items, creating principal components for conditioning, generating proficiency scores, and placing these proficiency scores on the PIRLS reading reporting scale.
The ePIRLS scaling methodology adopted the same four steps of calibration, conditioning, generating proficiency scores, and placing those scores on the PIRLS reading scale.
In the PIRLS 2001 analysis, achievement scales were produced for each of the two reading purposes— reading for literary experience and reading for information— as well as for reading overall. The PIRLS 2006 reading achievement scales were designed to provide reliable measures of student achievement common to both the 2001 and 2006 assessments, based on the metric established originally in 2001.
Plausible values, estimation, multiple imputation. Most cognitive skills testing is concerned with accurately assessing the performance of individual respondents, for the purposes of diagnosis, selection, or placement. Regardless of the measurement model used—whether classical test theory or item response theory—the accuracy of these measurements can be improved (i.e., the amount of measurement error can be reduced) by increasing the number of items given to the individual. Thus, it is common to see achievement tests designed to provide information on individual students that contain more than 70 items. For the distribution of proficiencies in large populations, however, more efficient estimates can be obtained from a matrix sampling design like that used in PIRLS. This design solicits relatively few responses from each sampled respondent while maintaining a wide range of content representation when responses are aggregated across all respondents. With this approach, however, the advantage of estimating population characteristics is offset by the inability to make precise statements about individuals. The uncertainty associated with individual estimates becomes too large to be ignored, and aggregations of individual student scores can lead to seriously biased estimates of population characteristics.
Plausible values methodology is a way to address this issue by using all available data to estimate directly the characteristics of student populations and subpopulations and then to generate multiple imputed scores (plausible values) from these distributions, which can be used in analyses with standard statistical software. For PIRLS, plausible values are estimated to characterize students participating in the assessment, given their background characteristics.
As mentioned, plausible values are imputed values and are not test scores for individuals in the usual sense. In fact, they are biased estimates of the proficiencies of individual students. Plausible values do, however, provide unbiased estimates of population characteristics (e.g., means and variances of demographic subgroups), and represent what the performance of an individual on the entire assessment might have been, had it been observed. Plausible values are estimated as random draws (usually five) from an empirically derived distribution of score values based on the student's observed responses to assessment items and on background variables. Each random draw from the distribution is considered a representative value from the distribution of potential scale scores for all students in the sample who have similar characteristics and identical patterns of item responses. Differences between plausible values drawn for a single individual quantify the degree of error (the width of the spread) in the underlying distribution of possible scale scores that could have caused the observed performances.
There have been several important changes to the PIRLS assessment since 2001.
The next administration is scheduled for 2021. PIRLS 2021 will be the new assessment, digitalPIRLS, which includes PIRLS and ePIRLS, and will be administered in a complete computer-based delivery system.