Program for International Student Assessment (PISA)

5. Data Quality and Comparability

A comprehensive program of continuous quality monitoring was central to ensuring full, valid implementation of the PISA procedures and the recording of deviations from these procedures. Quality monitors from the PISA Consortium visited a sample of schools in every jurisdiction to ensure that testing procedures were carried out in a consistent manner. The purpose of quality monitoring is to observe and record the implementation of the described procedures; therefore, the field operations manuals provided the foundation for all the quality monitoring procedures.

The manuals that formed the basis for the quality monitoring procedures were the PISA Consortium data collection manual and the PISA data management manual. In addition, the PISA data were verified at several points starting at the time of data entry.

Despite the efforts taken to minimize error, as with any study, PISA has limitations that researchers should take into consideration. This section contains a discussion of two possible sources of error in PISA: sampling and nonsampling errors.

Sampling Error

Sampling errors occur when a discrepancy between a population characteristic and the sample estimate arises because not all members of the target population are sampled for the survey. The size of the sample relative to the population and the variability of the population characteristics both influence the magnitude of sampling error. The particular sample of 15-year-old students from the 2017–18 school year was just one of many possible samples that could have been selected. Therefore, estimates produced from the PISA 2018 sample may differ from estimates that would have been produced had another sample of students been selected. This type of variability is called sampling error because it arises from using a sample of 15-year-old students rather than all 15-year-old students in that year.

The standard error is a measure of the variability owing to sampling when estimating a statistic. The approach used for calculating sampling variances in PISA is Fay's method of balanced repeated replication (BRR). This method of producing standard errors uses information about the sample design to produce more accurate standard errors than would be produced using simple random sample (SRS) assumptions for non-SRS data. Thus, the standard errors reported in PISA can be used as a measure of the precision expected from this particular sample.

Top

Nonsampling Error

Nonsampling error is a term used to describe variations in the estimates that may be caused by population coverage limitations, nonresponse bias, and measurement error, as well as data collection, processing, and reporting procedures. For example, the sampling frame in the United States was limited to regular public and private schools in the 50 states and the District of Columbia and cannot be used to represent Puerto Rico or other jurisdictions (e.g., other U.S. territories and DoD schools overseas). The sources of nonsampling errors are typically problems such as unit and item nonresponse, the differences in respondents' interpretations of the meaning of survey questions, response differences related to the particular time the survey was conducted, and mistakes in data preparation.

In general, it is difficult to identify and estimate either the amount of nonsampling error or how much bias it causes. In PISA 2015, efforts were made to prevent such errors from occurring and to compensate for them when possible. For example, the design phase entailed a field test that evaluated items as well as the implementation procedures for the survey. One type of nonsampling error that may be present in PISA is respondent bias, which occurs when respondents systematically misreport (intentionally or not) information in a study; a potential source of respondent bias in this survey was social desirability bias. For example, students may overstate their parents' educational attainment or occupational status. If there were no systematic differences among specific groups under study in their tendency to give socially desirable responses, then comparisons of the different groups would accurately reflect differences among groups. Readers should be aware that respondent bias may be present in this survey as in any survey; however, it is not possible to state precisely how such bias may affect the results.

Coverage error. Every National Project Manager (NPM) was required to define and describe their jurisdiction's national desired target population and explain how and why it might deviate from the international target population. Any hardships in accomplishing complete coverage were specified, discussed, and approved (or not) in advance. Where the national desired target population deviated from full national coverage of all eligible students, the deviations were described, and enrollment data provided to measure how much that coverage was reduced. School-level and within-school exclusions from the national desired target population resulted in a national defined target population corresponding to the population of students recorded in each jurisdiction's school sampling frame.

In PISA 2012, the United States reported 95 percent coverage of the national desired target population was achieved. For PISA 2015, the United States reported 83.5 percent coverage of the 15-year-old population and 96.7 coverage of national desired population. With a 3.3 percent overall exclusion rate, the United States reported a rate lower than the internationally acceptable exclusion rate of 5 percent. In PISA 2018, the United States reported 86.1 percent coverage of the 15-year-old population and 96.2 percent coverage of national desired population with a 3.8 percent overall exclusion rate.

Top

Nonresponse error. Nonresponse error results from nonparticipation of schools and students. School nonresponse, without replacement schools, will lead to the underrepresentation of students from the type of school that did not participate, unless weighting adjustments are made. It is also possible that only a part of the eligible population in a school (such as those 15-year-olds in a single grade) was represented by the school's student sample; this also requires weighting to compensate for the missing data from the omitted grades. Student nonresponse within participating schools occurred to varying extents. Students who could not be given achievement test scores but were not excluded for linguistic or disability reasons, will be underrepresented in the data unless weighting adjustments are made.

Unit nonresponse. Of the 257 original sampled schools in the PISA 2018 United States national sample, 162 agreed to participate. The weighted school response rate before replacement was 65 percent for the United States, requiring NCES to conduct a nonresponse bias analysis, which was used by the PISA consortium and the OECD to evaluate the quality of the final United States sample.

Table PISA-1. U.S. weighted school and student response rates: PISA 2018
	Weighted response rate (percent)
School
Before replacement	65.0
After replacement	76.4
Student	84.8
SOURCE: Organization for Economic Cooperation and Development (OECD), Program for International Student Assessment (PISA), 2018.

A total of 162 schools participated in the administration of national PISA, including 136 participating schools sampled as part of the original sample and 26 schools sampled as replacements for nonparticipating "original" schools. The overall weighted school response rate after replacements was 76.4 percent. For the United States as a whole, the weighted student response rate was 84.8 percent and the student exclusion rate was 3.8 percent.

For PISA 2015, a bias analysis was conducted in the United States to address potential problems in the data owing to school nonresponse; however, the investigation into nonresponse bias at the school level in the United States in PISA 2015 provided evidence that there is little potential for nonresponse bias in the PISA participating sample based on the characteristics studied. To compare PISA participating schools to the total eligible sample of schools, it was necessary to match the sample of schools to the sample frame to identify as many characteristics as possible that might provide information about the presence of nonresponse bias. Frame characteristics were taken from the 2012–13 Common Core of Data for public schools and from the 2011–12 Private School Universe Survey for private schools. The available school characteristics included affiliation (public or private), locale (central city, suburb, town, rural), Census region, number of age-eligible students, total number of students, and percentage of various racial/ethnic groups (White, non-Hispanic; Black, non-Hispanic; Hispanic; Asian; American Indian or Alaska Native; Hawaiian/Pacific Islander; and two or more races). The percentage of students eligible for free or reduced-price lunch was available for public schools only.

For the United States original sample schools, schools in the Northeast were underrepresented among participating schools relative to eligible schools (12.6 vs. 17.1 percent, respectively), while schools in the South were overrepresented among participating schools (43.3 vs. 37.8 percent, respectively). Participating schools had a lower mean percentage of White, non-Hispanic students than the eligible sample (49.1 vs. 53.1 percent, respectively) and a higher mean percentage of Hispanic students than the eligible sample (27.4 vs. 24.6 percent, respectively). Additionally, the absolute value of the relative bias for private schools and schools in towns is greater than 10 percent, which indicates potential bias even though no statistically significant relationship was detected. When all factors were considered simultaneously in a logistic regression analysis, none of the parameter estimates were significant predictors of participation. The percentage of students eligible for free or reduced-price lunch was not included in the logistic regression analysis as public and private schools were modeled together using only the variables available for all schools.

For the United States final sample schools (with substitutes), there were no statistically significant relationships between participation status and any of the characteristics studied. However, the absolute value of the relative bias for private schools, schools in towns and the Northeast region are greater than 10 percent, which indicates potential bias even though no statistically significant relationships were detected. When all factors were considered simultaneously in a logistic regression analysis (again with free or reduced-price lunch eligibility omitted), no variables were statistically significant predictors of participation.

In the United States final sample schools with substitutes when school nonresponse adjusted weights were used for the participating schools, there were no statistically significant relationships between participation status and any of the characteristics studied. We therefore conclude that there is little evidence of resulting potential bias in the final sample. The multivariate regression analysis cannot be conducted after the school nonresponse adjustments are applied to the weights. The concept of nonresponse adjusted weights does not apply to the nonresponding units, and, thus, we cannot conduct an analysis that compares respondents with nonrespondents using nonresponse adjusted weights.

In sum, the investigation into nonresponse bias at the school level in the United States in PISA 2015 provides evidence that there is little potential for nonresponse bias in the PISA participating sample based on the characteristics studied. It also suggests that the use of substitute schools substantially reduced the potential for bias. Moreover, after the application of school nonresponse adjustments, there is no evidence of resulting potential bias in the final sample.

For PISA 2018, nonresponse bias analyses were again conducted at the school level in the U.S. sample as the weighted school response rate was below 85 percent. The general approach taken involved an analysis in three parts: (1) Analysis of the participating original sample: The distribution of the participating original school sample was compared with that of the total eligible original school sample. (2) Analysis of the participating final school sample with substitutes: The distribution of the participating final school sample, which included participating substitutes that were used as replacements for nonresponding schools from the eligible original sample, was compared to the total eligible final school sample. (3) Analysis of the nonresponse adjusted final sample with substitutes: The same sets of schools were compared as in the second analysis, but this time, when analyzing the participating final schools alone, school nonresponse adjustments were applied to the size-adjusted school base weights. The international weighting procedures form nonresponse adjustment classes by cross classifying the explicit and implicit stratification variables. The eligible sample were again weighted by their size-adjusted school base weights.

In addition to these tests, logistic regression models were used to provide a multivariate analysis that examined the conditional independence of these school characteristics as predictors of participation. The logistic regression compared frame characteristics for participating schools with non-participating schools. Multivariate analysis can provide additional insights, over and above those gained through the bivariate analysis. It may be the case that only one or two variables were actually related to participation status. However, if these variables were also related to the other variables examined in the analyses, then other variables, which were not related to participation status, would appear as significant in simple bivariate tables. Multivariate analysis, in contrast, examined the conditional relationships with participation after controlling for the other predictor variables—thereby, testing the robustness of the relationships between school characteristics and participation.

For original sample schools (not including substitute schools), nine variables were found to be statistically significantly related to participation in the bivariate analysis: school control, census region, poverty level, total school and age-eligible enrollments, White, non-Hispanic, Black, non-Hispanic, Hispanic, and free or reduced-price lunch. Additionally, the absolute value of the relative bias for small sized and large sized schools, American Indian or Alaska Native, and Hawaiian/Pacific Islander was greater than 10 percent, which indicated potential bias even though no statistically significant relationship was detected.

For the final sample of schools (with substitute schools) with school nonresponse adjustments applied to the weights, no variables were found to be statistically significantly related to participation in the bivariate analysis. However, the absolute value of the relative bias for small sized schools and Hawaiian/Pacific Islander was greater than 10 percent.

In sum, the investigation into nonresponse bias at the school level in the U.S. PISA 2018 data provides evidence that there is some potential for nonresponse bias in the PISA participating original sample based on the characteristics studied. It also suggests that, while there is some evidence that the use of substitute schools reduced the potential for bias, it has not reduced it substantially. However, after the application of school nonresponse adjustments, there is little evidence of resulting potential bias in the available frame variables and correlated variables in the final sample.

Measurement error. Measurement error is introduced into a survey when its test instruments do not accurately measure the knowledge or aptitude they are intended to assess.

Top

Data Comparability

A number of international comparative studies already exist to measure achievement in mathematics, science, and reading, including the Trends in International Mathematics and Science Study (TIMSS) and the Progress in International Reading Literacy Study (PIRLS). The Adult Literacy and Lifeskills Survey (ALL) was last conducted in 2003 and measured the literacy and numeracy skills of adults. A new study, the Program for the International Assessment of Adult Competencies (PIAAC), was administered in 2012 and 2014, and assessed the level and distribution of adult skills required for successful participation in the economy of participating jurisdictions. In addition, the United States has been conducting its own national surveys of student achievement for more than 35 years through the National Assessment of Educational Progress (NAEP). PISA differs from these studies in several ways.

Content. PISA is designed to measure literacy broadly, whereas studies such as TIMSS and NAEP have a stronger link to curricular frameworks and seek to measure students' mastery of specific knowledge, skills, and concepts. The content of PISA is drawn from broad content areas (e.g., space and shape in mathematics) in contrast to more specific curriculum-based content, such as geometry or algebra. For example, with regard to the reading assessment, PISA must contain passages applicable to a wide range of cultures and languages, making it unlikely that the passages will be intact, existing texts.

Top

Tasks. PISA also differs from other assessments in that it emphasizes the application of reading, mathematics, and science literacy to everyday situations by asking students to perform tasks that involve interpretation of real-world materials as much as possible. A study comparing the PISA, NAEP, and TIMSS mathematics assessments found that the mathematics topics addressed by each assessment are similar, although PISA places greater emphasis on data analysis and less on algebra than does either NAEP or TIMSS. However, it is in how that content is presented that makes PISA different. PISA uses multiple-choice items less frequently than NAEP or TIMSS, and it contains a higher proportion of items reflecting moderate to high mathematical complexity than do those two assessments.

An earlier comparative analysis of the PISA, TIMSS, and NAEP mathematics and science assessments also found differences between PISA and the other two studies. In science, it found that more items in PISA built connections to practical situations and required students to demonstrate multistep reasoning and fewer items used a multiple-choice format than in NAEP or TIMSS. In mathematics, it found that more items in PISA than in NAEP or TIMSS were set in real-life situations or scenarios, required multistep reasoning, and required interpretation of figures and other graphical data. These tasks reflect the underlying assumption of PISA: as 15-year-olds begin to make the transition to adult life, they need to know how to read or use particular mathematical formulas or scientific concepts, as well as how to apply this knowledge and these skills in the many different situations they will encounter in their lives.

Age-based sample. In contrast with TIMSS and PIRLS, which are grade-based assessments, PISA's sample is based on age. TIMSS assesses fourth- and eighth-graders, while PIRLS assesses only fourth-graders. The PISA sample, however, is drawn from 15‑year-old students, regardless of grade level. The goal of PISA is to represent outcomes of learning rather than outcomes of schooling. By placing the emphasis on age, PISA intends to show not only what 15-year-olds have learned in school in a particular grade, but outside of school as well as over the years. PISA thus seeks to show the overall yield of an economy and the cumulative effects of all learning experience. Focusing on age 15 provides an opportunity to measure broad learning outcomes while all students are still required to be in school across the many participating jurisdictions. Finally, because years of education vary among jurisdictions, choosing an age-based sample makes comparisons across jurisdictions somewhat easier.

Top