This section describes features of the Program for International Student Assessment (PISA) 2018 methodology,including sample design, test design, and scoring, with focus on U.S. implementation. For further details about the assessment and any of the topics discussed here, see the Organization for Economic Cooperation and Development's (OECD) PISA 2018 Technical Report.
These Methodology and Technical Notes provide an overview, with a particular focus on the U.S. implementation, of the following technical aspects of ICILS 2018:
International and U.S. Response Rates
More detailed information can be found in the ICILS 2018 Technical Report at https://www.iea.nl/publications/technical-reports/icils-2018-technical-report.
The OECD required all participating education systems (countries and subnational regions) to adhere to the PISA 2018 technical standards (OECD 2015), which provided detailed information about the target population, sampling, response rates, translation and adaptation, assessment administration, and data submission. According to the standards, the international desired population in each education system consisted of 15-year-olds attending publicly and privately controlled schools in grade 7 and higher. To provide valid estimates of student achievement and characteristics, the sample of PISA students had to be selected in a way that represented the full population of 15-year-old students in each education system. The sample design for PISA 2018 was a stratified systematic sample, with sampling probabilities proportional to the estimated number of 15-year-old students in the school based on grade enrollments. Samples were drawn using a two-stage sampling process. The first stage was a sample of schools, and the second stage was a sample of students within schools. The PISA international contractors responsible for the design and implementation of PISA internationally (hereafter referred to as the PISA consortium) drew the sample of schools for each education system.
Sample Size.A minimum of 42 assessed students for a minimum of 150 participating schools for a total sample of 6,300 assessed students was required in each country that planned to administer computer-based assessments.1 Following the PISA consortium guidelines, replacement schools were identified at the same time the PISA sample was selected by assigning the two schools neighboring the sampled school in the frame as replacements. For countries administering financial literacy, an additional sample of students was selected. In the United States up to 52 students were sampled within schools. Students were selected in an equal probability sample unless fewer than 52 students age 15 were available (in which case all 15-year-old students were selected).
Age Guidelines. Each education system collected its own data, following international guidelines and specifications. The technical standards required that students in the sample be 15 years and 3 months to 16 years and 2 months at the beginning of the testing period (hereafter referred to as “15-year-olds” or “15-year-old students”). The maximum length of the testing period was no longer than eight consecutive weeks in duration for computer-based testing participants, and no longer than six for paper-based testing participants. Most education systems conducted testing from March through August 2018.2
Response Rates. International guidelines were given for both school-level as well as student-level response rates.
Exclusion Rate. PISA 2018 is designed to be as inclusive as possible. The guidelines allowed schools to be excluded for approved reasons (for example, schools in remote regions, very small schools, or special education-only schools). Schools used the following international guidelines on student exclusions:
Students with functional disabilities . These are students with a moderate to severe permanent physical disability such that they cannot perform in the PISA testing environment.
Students with intellectual disabilities . They have a cognitive, behavioral or emotional disability confirmed by qualified staff, meaning they cannot take the PISA test. These are students who are cognitively, behaviorally or emotionally unable to follow even the general instructions of the assessment.
Students with insufficient language experience . These are students who meet the three criteria of not being native speakers in the assessment language, having limited proficiency in the assessment language, and having less than 1 year of instruction in the assessment language.
Students can also be excluded if there are no materials available in the language in which the student is taught and if they cannot be assessed for some other reason as agreed upon.
Overall estimated exclusions (including both school and student exclusions) were to be under 5 percent of the PISA target population. To keep PISA as inclusive as possible and to keep the exclusion rate down, the United States used the UH (‘Une Heure’) instrument designed for students with special education needs. See the description of the UH instrument in the next section.
1 Nine countries – Argentina, Jordan, Lebanon, the Republic of Moldova, the Republic of North Macedonia, Romania, Saudi Arabia, Ukraine and Vietnam – assessed their students’ knowledge and skills in PISA 2018 using paper-based instruments. These countries needed to have a minimum of 35 assessed students in 150 schools for a total of 5,250 assessed students in the PISA sample.
2 The United States and the United Kingdom were given permission to move the testing dates to October through November in an effort to improve response rates. The range of eligible birth dates was adjusted so that the mean age remained the same (i.e., 15 years and 3 months to 16 years and 2 months at the beginning of the testing period). In 2003, the United States conducted PISA in the spring and fall and found no significant difference in student performance between the two time points. The United States has collected data in the fall in every PISA cycle since 2003
The PISA 2018 school sample was drawn for the United States by the PISA consortium. The U.S. PISA sample was stratified into 8 explicit groups based on region of the country (Northeast, Central, West, Southeast)3, and control of school (public or private). Within each stratum, the frame was sorted for sampling by five categorical stratification variables: grade range of the school (five categories); type of location relative to populous areas (city, suburb, town, rural);4 combined percentage of Black, Hispanic, Asian, Native Hawaiian/Pacific Islander, and American Indian/Alaska Native students (above or below 15 percent); gender (mostly female (percent female ≥ 95 percent), mostly male (percent female < 5 percent); and other); and state.
The United States took part in the core PISA assessment for reading, math, and science literacy as well as the optional domain of financial literacy. To obtain an adequate sample of students in the United States that also took into consideration historical rates of nonresponse, 52 students aged 15 were randomly sampled within each school. If fewer than 52 age-eligible students were enrolled in a school, all 15-year-old students in the school were selected. Thus, in each school, each age-eligible student had an equal probability of being selected. In order to be eligible for PISA, students had to be born between July 1, 2002, and June 30, 2003.
In the United States, of the 52 students who were randomly sampled within each school, 41 students took the mathematics, science, and reading literacy assessments and 11 students took the optional financial literacy assessment. The group of students who took the financial literacy assessment are referred to as the “financial literacy sample”. Note that this was different from the approach used in the 2015 cycle, when financial literacy was administered to a subset of the students in the main PISA sample. As in past rounds of PISA, the United States planned to assess schools within the maximum testing period of eight consecutive weeks, from October to November 2018. However, based on the need for additional participating schools, and with approval from PISA international contractors, the United States extended data collection into December 2018.
The U.S. PISA 2018 national school sample consisted of 257 schools. This number represents an increase from the international minimum requirement of 150 and was implemented to offset anticipated school nonresponse and reduce design effects. Schools were selected with probability proportionate to the school’s estimated enrollment of 15-year-olds. The data for public schools were from the 2015–16 Common Core of Data (CCD) and the data for private schools were from the 2015–16 Private School Universe Survey (PSS). Any school containing at least one of grades 7 through 12 was included in the school sampling frame. Participating schools provided a list of 15-year-old students (typically in August or September 2018) from which the student sample was drawn using sampling software provided by the international contractor.
In the United States, 4,811 15-year-old students took part in the core PISA 2018 assessment and 1,520 15-year-old students took part in the financial literacy assessment.
In addition to the international response rate standards described in the prior section, the U.S. sample had to meet the statistical standards of the National Center for Education Statistics (NCES) of the U.S. Department of Education. For an assessment like PISA, NCES requires that a nonresponse bias analysis be conducted when the response rate for schools falls below 85 percent or the response rate for students falls below 85 percent.
In order to keep PISA as inclusive as possible and to keep the exclusion rate down, the United States used the UH (‘Une Heure’) instrument designed for students with special education needs. The UH instrument was available to special education needs students within mainstream schools and contained about half as many items as the regular test instrument. These testing items were deemed more suitable for students with special education needs. A UH student questionnaire was also administered, which only contained trend items from the regular student questionnaire. The timing structure of both the UH test instrument and UH student questionnaire allowed more time per question than the regular instruments and UH sessions were generally held in small groups.
3 The Northeast region consists of Connecticut, Delaware, the District of Columbia, Maine, Maryland, Massachusetts, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, and Vermont. The Central region consists of Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Missouri, Nebraska, North Dakota, Ohio, Wisconsin, and South Dakota. The West region consists of Alaska, Arizona, California, Colorado, Hawaii, Idaho, Montana, Nevada, New Mexico, Oklahoma, Oregon, Texas, Utah, Washington, and Wyoming. The Southeast region consists of Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, Virginia, and West Virginia.
4 These types are defined as follows: (1) “city” is a territory inside an urbanized area with a core population of 50,000 or more and inside a principal city; (2) “suburb” is a territory inside an urbanized area with a core population of 50,000 or more and outside a principal city; (3) “town” is a territory inside an urban cluster that is greater than 10 miles and fewer than or equal to 35 miles from an urbanized area ; and (4) “rural” is Census-defined rural territory that is fewer than or equal to 5 miles from an urbanized area, as well as fewer than or equal to 2.5 miles from an urban cluster.
The 2018 assessment instruments were developed by international experts and PISA consortium test developers and included items submitted by participating education systems. In 2018, the major focus of PISA was on reading literacy, with mathematics and science literacy treated as minor domains. Financial literacy was an optional domain administered by 21 education systems including the United States.
All mathematics and science items in the 2018 assessment instrument were trend items from previous assessments. Reading literacy and financial literacy included both trend items and new items developed for 2018.5 Items were reviewed by representatives of each country and the PISA subject-matter expert groups for possible bias and relevance to PISA’s goals. To further examine potential biases and design issues in the PISA assessment, all participating education systems field-tested the assessment items in spring 2017. After the field trial, items that did not meet the established measurement criteria or were otherwise found to include intrinsic biases were dropped for the main assessment.
For the 2018 cycle, the number of assessment items by subject is as shown in table A-1:
|Table A-1. Number of new and trend items in PISA 2018, by domain|
|NOTE: The number of new and trend items shown in this table reflect the design for the computer-based PISA assessment only.|
|SOURCE: Organization for Economic Cooperation and Development (OECD), Program for International Student Assessment (PISA), 2018.|
PISA Test Design. To provide the most comprehensive measure of reading literacy, PISA would have to present each student with the complete set of test items. Asking students to answer all such items would be the best way to eliminate any gaps or biases in the assessment. However, this would result in a test that would take more than six hours to complete.
To make it feasible to measure student proficiency in all domains, the test material in all PISA cycles, up to and including PISA 2018, was divided into several 30-minute clusters or test booklets. There were six 30-minute trend clusters of test material for math and science each. A multi-stage adaptive design was adopted for the reading assessment in PISA 2018 [more details on the PISA multi-stage adaptive test in Reading literacy are provided below]. Materials equivalent to 15 30-min clusters, but organized into units, rather than clusters were used for the reading adaptive design (5 trend clusters and approximately 10 new clusters).
These clusters were linked across domains and organized into test forms, which were then randomly allocated to students. Students received two 30-minute clusters of test material in the major domain along with two clusters of test material in one or two of the other domains. Each student saw only a small subset of the test material and was thus assessed on only a selection of the skills and competencies that comprise each domain. Nonetheless, students in an education system, when taken as a group, were examined on the complete set of skills.
For countries like the United States that took part in the core CBA assessment and the optional financial literacy assessment but
did not opt to take part in the optional global competency domain, a total of 36 CBA testing forms were assembled for the assessment.
In total, these 36 forms included the following groups or clusters of test items: six clusters (each 30-min testing time) from each of
the trend domains of mathematics and science literacy, two clusters (each 30-min testing time) of financial literacy items, and
15 reading units of 30-min each (equivalent to 5 trend clusters and approximately 10 new clusters). As reading literacy is the major
domain for 2018, reading tasks are included in all test forms and paired with one or two of the other minor domains
(i.e., mathematics, science, or financial literacy 6), and each of the different combinations of domains is balanced
in terms of position. Each student took two hours of testing: they all took 60-minutes of reading tasks and another
60-minutes of tasks (2 clusters) from one or more of the other domains (i.e., science, mathematics, or financial literacy).
The students who were in the separate financial literacy sample included two sets of students:
It's important to note that the 2nd group of students appear in both the main core sample as well as the financial literacy sample (they receive different weights though for the financial literacy database than the main core database).
The PISA test design reflects a random assignment of a form within a school following a specific pre-assigned probability distribution. In the United States, 92 percent of students received forms numbered 1–24 of these 36 forms while 8 percent of students received forms numbered 25–36. These percentages are based on random assignment of test forms to students across schools. Each student in each classroom has a real probability of receiving any of the forms. The combinations of test material within forms numbered 1–24 include: i) reading and science literacy and ii) reading and mathematical literacy. These forms were sampled at a higher rate and provided the necessary covariance information between reading literacy and each of the two minor domains. In addition, forms numbered 25–36 provided tri-variate information about the three domains and included 2 reading clusters and one mathematics and one science cluster each. These forms were sampled at a lower rate so that only 8 percent of students receive one of these forms and received one hour of reading literacy plus two 30-minute clusters of items from each of the other two minor domains.
PISA Multi-stage Adaptive test in Reading Literacy. Despite the randomization procedure used in the design of the PISA test, one source of inaccuracy remains. Most students in OECD countries score near the middle of the score, or at around 500 points. Most of the test material is also targeted to middle-performing students, which allows for more refined differentiation of student ability at this level. However, this means that there is a relative lack of test material at the higher and lower ends of student ability, and that the scores of both high- and low-performing students are determined with less accuracy than the scores of middle-performing students.
In order to increase the accuracy of such measurements, PISA 2018 introduced adaptive testing in its reading assessment. Instead of using fixed, predetermined test booklets as was done through PISA 2015, the reading assessment given to each student was dynamically determined, based on how the student performed in prior stages.
There were three stages to the PISA 2018 reading assessment: Core, Stage 1 and Stage 2. Students first saw a short Core stage, which consisted of between 7 and 10 items. The vast majority of these items (at least 80 percent and always at least 7 items) were automatically scored. Students’ performance in this stage was provisionally classified as low, medium, or high, depending on the number of correct answers to these automatically scored items.
The various Core Blocks of material delivered to students did not differ in any meaningful way in their difficulty. Stage 1 and 2, however, both existed in two different forms: comparatively easy and comparatively difficult. Students who displayed medium performance in the Core stage were equally likely to be assigned an easy or a difficult Stage 1. Students who displayed low performance in the Core stage had a 90 percent chance of being assigned to an easy Stage 1 and a 10 percent chance of being assigned to a difficult Stage 1. Students who displayed high performance in the Core stage had a 90 percent chance of being assigned to a difficult Stage 1 and a 10 percent chance of being assigned to an easy Stage 1.
Students were assigned to easy and difficult Stage 2 blocks of material in much the same way. In order to classify student performance as precisely as possible, however, responses to automatically scored items from both the Core stage and Stage 1 were used.
As with many of the new features in the reading framework, adaptive testing was made possible through the use of computers. One potential drawback of an adaptive design is that students are unable to return to a question after it has been answered or skipped. This was already the case in the PISA 2015 computer-based assessment. However, with adaptive testing, students’ responses in the Core stage and in Stage 1 affected not only their performance but also the questions that they saw later in the assessment. The PISA 2018 Technical Report presents further indicators of the impact of adaptive testing on students’ test-taking behavior.
Reading Fluency. In addition to the typical reading literacy items, the 2018 reading literacy instrument will include a measure of reading fluency in the form of sentence processing. This measure requires students to make a sensibility judgment about sentences of increasing complexity and was designed to provide additional information about the reading skills of students at the lower end of the proficiency range. Information from this task, combined with the typical reading literacy items will allow for a more thorough understanding of how students differ at various levels of the proficiency scale. In the Main Survey, there are 65 reading fluency sentences that are organized into 5 clusters of 11 sentences and 1 cluster of 10 sentences. Each student is assigned two fluency clusters for a total of 21 or 22 sentences right before the reading literacy clusters. These reading fluency tasks are administered within a 3-minute timed session which means that any sentences not completed within the three-minute session will be skipped. Reading fluency items were considered in the computation of students’ overall score. However, these items were not included in the computation of subscale scores (neither the text-source subscale nor the reading-process subscale).
After the cognitive assessment, students also completed a 30-minute questionnaire designed to provide information about their backgrounds, attitudes, and experiences in school. Principals in schools where PISA was administered also completed a 45-minute questionnaire, administered online, designed to provide information on their school’s structure, resources, instruction, climate, and policies.
In addition, a sample of teachers within each school were selected to complete a 30-minute questionnaire, also administered online. The questionnaire was designed to provide information on teachers’ backgrounds, education and professional development, and teaching practices. Up to ten English/language arts teachers and fifteen non-English/language arts teachers eligible to teach the modal grade (10th grade in the United States) were sampled in each school. Similar to the test development of the main assessment, student, school, and teacher questionnaire items that did not meet the established measurement criteria or were otherwise found to include intrinsic biases in the field trial were dropped from the main assessment.
Translation and Adaptation
Source versions of all instruments (the assessment booklets, questionnaires, and operations manuals) were prepared in English and French and translated into the primary language or languages of instruction in each education system. The PISA consortium recommended a double translation design and provided precise translation guidelines that included a description of the features each item was measuring and statistical analysis from the field trial. This entailed having two independent translations, one from each of the source languages (English and French), and reconciliation by a third party. When double translation was not possible, single translation was accepted. In addition, the PISA consortium verified the instrument translation when more than 10 percent of an education system’s PISA population used a national language that was neither French nor English.
Instrument adaptation was necessary even in nations such as the United States that use English as the primary language of instruction. These adaptations were primarily for cultural purposes. For example, words such as “lift”; might be adapted to “elevator” for the United States. The PISA consortium verified and approved the national adaptation of all instruments, including that of the United States.
Test Administration and Quality Assurance
The PISA consortium emphasized the use of standardized procedures in all education systems. Each education system collected its own data, based on detailed manuals provided by the PISA consortium that explained the survey’s implementation, including precise instructions for the work of school coordinators and test administrators and scripts for test administrators to use in testing sessions. Test administration in the United States was conducted by professional staff trained in accordance with the international guidelines. Students could use calculators, and U.S. students were provided calculators.
In each education system, a PISA Quality Monitor (PQM) who was engaged independently by the PISA consortium observed test administrations in a subsample of participating schools. The schools in which the independent observations were conducted were selected jointly by the PISA consortium and the PQM. In the United States, there were five PQMs who observed 15 schools. The PQM’s primary responsibility was to document the extent to which testing procedures in schools were implemented in accordance with test administration procedures. The PQM’s observations in U.S. schools indicated that international procedures for data collection were applied consistently.
5 In the vast majority of participating countries, PISA 2018 was a computer-based assessment. However, nine countries – Argentina, Jordan, Lebanon, the Republic of Moldova, the Republic of North Macedonia, Romania, Saudi Arabia, Ukraine and Vietnam – assessed their students’ knowledge and skills in PISA 2018 using paper-based instruments. These paper-based tests were offered to countries who were not ready, or did not have the resources, to transition to a computer-based assessment. The paper-based tests comprise a subset of the tasks included in the computer-based version of the tests, all of which were developed in earlier cycles of PISA. No tasks that were newly developed for PISA 2015 or PISA 2018 were included in the paper-based instruments; consequently, the new aspects of the science and reading frameworks are not reflected in the paper-based tests.
6 The PISA financial literacy assessment was an optional component. In 2018, the United States administered financial literacy along with the three core PISA subjects of reading, mathematics, and science literacy. Financial literacy items were rotated among the booklets encountered by a subsample of students in the same session along with the reading, mathematics, and science items, administered to U.S. students in 2018.
The use of sampling weights is necessary for computing statistically sound, nationally representative estimates. Adjusted survey weights adjust for the probabilities of selection for individual schools and students, for school or student nonresponse, and for errors in estimating the size of the school or the number of 15-year-olds in the school at the time of sampling. Survey weighting for all education systems participating in PISA 2018 was coordinated by Westat, as part of the international PISA consortium.
The school base weight was defined as the reciprocal of the school's probability of selection multiplied by the number of eligible students in the school. (For replacement schools, the school base weight was set equal to the original school it replaced.) The student base weight was given as the reciprocal of the probability of selection for each selected student from within a school.
The product of these base weights was then adjusted for school and student nonresponse. The school nonresponse adjustment was done individually for each education system by cross-classifying the explicit and implicit stratification variables defined as part of the sample design.
The student nonresponse adjustment was done within cells based first on their school nonresponse cell and their explicit stratum; within that, grade and gender were used when possible. All PISA analyses were conducted using these adjusted sampling weights. For more information on the nonresponse adjustments, see OECD's PISA 2018 Technical Report.
Each test form had a different subset of items. Because each student completed only a subset of all possible items, classical test scores, such as the percentage correct, are not accurate measures of student performance. Instead, scaling techniques were used to establish a common scale for all students. For PISA 2018, item response theory (IRT) was used to estimate average scores for reading, science, and mathematics literacy for each education system, as well as for three reading process and three reading content subscales. For education systems participating in the financial literacy assessment these assessments were scaled separately and assigned separate scores. IRT identifies patterns of response and uses statistical models to predict the probability of answering an item correctly as a function of the students’ proficiency in answering other questions. With this method, the performance of a sample of students in a subject area or subarea can be summarized on a simple scale or series of scales, even when students are administered different items.
Scores for students were estimated as plausible values because each student completed only a subset of items. Ten plausible values were estimated for each student for each scale. These values represented the distribution of potential scores for all students in the population with similar characteristics and identical patterns of item response. Statistics describing performance on the PISA reading, science, mathematics, and financial literacy scales are based on plausible values. In PISA, the reading, science, mathematics and financial literacy scales are from 0—1,000.
In addition to using a range of scale scores as the basic form of measurement, PISA describes student proficiency in terms of levels of proficiency. Higher levels represent the knowledge, skills, and capabilities needed to perform tasks of increasing complexity. PISA results are reported in terms of percentages of the student population at each of the predefined levels.
To determine the performance levels and cut scores on the literacy scales, IRT techniques were used. With IRT techniques, it is possible to simultaneously estimate the ability of all students taking the PISA assessment, as well as the difficulty of all PISA items. Estimates of student ability and item difficulty can then be mapped on a single continuum. The relative ability of students taking a particular test can be estimated by considering the percentage of test items they get correct. The relative difficulty of items in a test can be estimated by considering the percentage of students getting each item correct. In PISA, all students within a level are expected to answer at least half of the items from that level correctly. Students at the bottom of a level are able to provide the correct answers to about 52 percent of all items from that level, have a 62 percent chance of success on the easiest items from that level, and have a 42 percent chance of success on the most difficult items from that level. Students in the middle of a level have a 62 percent chance of correctly answering items of average difficulty for that level (an overall response probability of 62 percent). Students at the top of a level are able to provide the correct answers to about 70 percent of all items from that level, have a 78 percent chance of success on the easiest items from that level, and have a 62 percent chance of success on the most difficult items from that level. Students just below the top of a level would score less than 50 percent on an assessment at the next higher level. Students at a particular level demonstrate not only the knowledge and skills associated with that level but also the proficiencies defined by lower levels. Patterns of responses for students in the proficiency levels labeled below level 1c for reading literacy, below level 1b for science literacy, and below level 1 for mathematics literacy and financial literacy suggest that these students are unable to answer at least half of the items from those levels correctly. For details about the approach to defining and describing the PISA proficiency levels and establishing the cut scores, see the OECD’s PISA 2018 Technical Report . Table A-2 shows the cut scores for each proficiency level for reading, science, and mathematics literacy.
|Table A-2. Cut scores for proficiency levels for reading, science, and mathematics literacy: 2018|
|Proficiency level||Reading||Science||Mathematics||Financial literacy|
|Level 1 (1c)||189.33 to less than 262.04||—||357.77 to less than 420.07||325.57 to less than 400.33|
|Level 1 (1b)||262.04 to less than 334.75||260.54 to less than 334.94|
|Level 1 (1a)||334.75 to less than 407.47||334.94 to less than 409.54|
|Level 2||407.47 to less than 480.18||409.54 to less than 484.14||420.07 to less than 482.38||400.33 to less than 475.10|
|Level 3||480.18 to less than 552.89||484.14 to less than 558.73||482.38 to less than 544.68||475.10 to less than 549.86|
|Level 4||552.89 to less than 625.61||558.73 to less than 633.33||544.68 to less than 606.99||549.86 to less than 624.63|
|Level 5||625.61 to less than 698.32||633.33 to less than 707.93||606.99 to less than 669.30||624.63 to less than 1000|
|Level 6||698.32 to less than 1000||707.93 to less than 1000||669.30 to less than 1000||—|
|— Not applicable.|
|NOTE: For reading literacy, proficiency level 1 is composed of three levels, 1a, 1b, and 1c. For science literacy, proficiency level 1 is composed of two levels, 1a and 1b. The score range for below level 1 refers to scores below level 1b. For mathematics and financial literacy, there is a single proficiency category at level 1.|
|SOURCE: Organization for Economic Cooperation and Development (OECD), Program for International Student Assessment (PISA), 2018.|
As with any study, there are limitations to PISA 2018 that should be taken into consideration. Estimates produced using data from PISA 2018 are subject to two types of error: nonsampling errors and sampling errors.
Nonsampling error is a term used to describe variations in the estimates that may be caused by population coverage limitations, nonresponse bias, and measurement error, as well as data collection, processing, and reporting procedures. For example, suppose the study was unsuccessful in getting permission from many rural schools in a certain region of the country. In that case, reports of means for rural schools for that region may be biased. Fortunately, such a coverage problem did not occur in PISA in the United States. The sources of nonsampling errors are typically problems such as unit and item nonresponse, the differences in respondents’ interpretations of the meaning of survey questions, and mistakes in data preparation.
Sampling errors arise when a sample of the population, rather than the whole population, is used to estimate some statistic. Different samples from the same population would likely produce somewhat different estimates of the statistic in question. This means that there is a degree of uncertainty associated with statistics estimated from a sample. This uncertainty is referred to as sampling variance and is usually expressed as the standard error of a statistic estimated from sample data. The approach used for calculating standard errors in PISA is the Fay method of balanced repeated replication (BRR) (Judkins 1990). This method of producing standard errors uses information about the sample design to produce more accurate standard errors than would be produced using simple random sample assumptions.
Standard errors can be used as a measure of the precision expected from a particular sample. Standard errors for all statistics reported in this report are available in each of the downloadable Excel tables that accompany each online figure and table at http://nces.ed.gov/surveys/pisa/pisa2018.
Confidence intervals provide a way to make inferences about population statistics in a manner that reflects the sampling error associated with the statistic. Assuming a normal distribution and a 95 percent confidence interval, the population value of this statistic can be inferred to lie within the confidence interval in 95 out of 100 replications of the measurement on different samples drawn from the same population.
In this report, PISA 2018 results are provided for groups of students with different demographic characteristics. Definitions of student population groups are as follows:
Gender: Results are reported separately for male students and female students.
Race/ethnicity: In the United States, students’ race/ethnicity was obtained through student responses to a two-part question in the student questionnaire. Students were asked first whether they were Hispanic or Latino and then whether they were members of the following racial groups: White (non-Hispanic), Black (non-Hispanic), Asian (non-Hispanic), American Indian or Alaska Native (non-Hispanic), or Native Hawaiian/Other Pacific Islander (non-Hispanic). Multiple responses to the race classification were allowed. Results are shown separately for White (non-Hispanic), Black (non-Hispanic), Hispanic, Asian (non-Hispanic), and non-Hispanic students who selected more than one race (labeled as Two or more races). Students identifying themselves as Hispanic and one or more race were included in the Hispanic group, rather than in a racial group.
PISA index of economic, social, and cultural status (ESCS): PISA uses a composite measure that combines into a single score the financial, social, cultural and human capital resources available to students. ESCS has been computed and used in analyses since the first cycle of PISA in 2000. Currently, the ESCS index is derived from three indices: highest parental occupation (HISEI), highest parental education (PARED), and one IRT scale based on student reports on home possessions including books in the home (HOMEPOS).
The three indices on which it is based are described below:
Eligibility for Free or Reduced-price Lunch (FRPL): The percentage of students receiving free or reduced-price lunch is often used as a proxy measure for the percentage of students living in poverty. While the percentage of students receiving free or reduced-priced lunch can provide some information about relative poverty, it is not the actual percentage of students in poverty enrolled in school. The National School Lunch Program provides meals to millions of children each school day. All lunches provided by the National School Lunch Program are considered subsidized to some extent because meal-service programs at schools must operate as non-profit programs. While all students at participating schools are eligible for regular priced lunches through the National School Lunch Program, there are multiple ways in which a student can become eligible for a free/reduced price lunch. Traditionally, family income has been used to establish eligibility for free/reduced price lunch. Despite its limitations, the free/reduced price lunch data are often used by education researchers as a proxy for school poverty since this count is generally available at the school level, while the poverty rate is typically not available. In the U.S.-version of the PISA school questionnaire, principals were asked for the percentage of students in school eligible for FRPL.
Confidentiality analyses for the United States were designed to provide reasonable assurance that public-use data files issued by the PISA consortium would not allow identification of individual U.S. schools or students when compared against other public-use data collections. Disclosure limitations included identifying and masking potential disclosure risk to PISA schools and including an additional measure of uncertainty to school and student identification through random swapping of data elements within the student and school file. Swapping was designed to not significantly affect estimates of means and variances for the whole sample or reported subgroups (Krenzke et al. 2006).
Comparisons made in the text of this report have been tested for statistical significance. For example, in the commonly made comparison of OECD averages to U.S. averages, tests of statistical significance were used to establish whether or not the observed differences from the U.S. average were statistically significant.
In almost all instances, the tests for significance used were standard t tests. These fell into three categories according to the nature of the comparison being made: comparisons of independent samples, comparisons of nonindependent samples, and comparisons of performance over time. In PISA, education system groups are independent. We judge that a difference is “significant” if the probability associated with the t test is less than .05. If a test is significant this implies that difference in the observed means in the selected sample likely represents a real difference in the population.7 No adjustments were made for multiple comparisons.
In simple comparisons of independent averages, such as the average score of education system 1 with that of education system 2, the following formula was used to compute the t statistic:
where est1 and est2 are the estimates being compared (e.g., averages of education system 1 and education system 2) and se12 and se22 are the corresponding squared standard errors of these averages. The PISA 2018 data are hierarchical and include school and student data from the participating schools. The standard errors for each education system take into account the clustered nature of the sampled data. These standard errors are not adjusted for correlations between groups since groups are independent.
The second type of comparison occurs when evaluating differences between nonindependent groups within the education system. Because of the sampling design in which schools and students within schools are randomly sampled, the data within the education system from mutually exclusive sets of students (for example, males and females) are not independent. For example, to determine whether the performance of females differs from that of males would require estimating the correlation between females’ and males’ scores. A BRR procedure, mentioned above, was used to estimate the standard errors of differences between nonindependent samples within the United States. Use of the BRR procedure implicitly accounts for the correlation between groups when calculating the standard errors.
To test comparisons between nonindependent groups the following t statistic formula was used:
where est grp1 and estgrp2 are the nonindependent group estimates being compared and se(grp1-grp2) is the standard error of the difference calculated using BRR to account for the correlation between the estimates for the two nonindependent groups.
A third type of comparison—the addition of a standard error term to the standard t test shown above for simple comparisons of independent averages—was also used when analyzing change in performance over time. The transformation that was performed to equate the 2018 data with previous data depends upon the change in difficulty of each of the individual link items and as a consequence the sample of link items that have been chosen will influence the choice of transformation. This means that if an alternative set of link items had been chosen the resulting transformation would be slightly different. The consequence is an uncertainty in the transformation due to the sampling of the link items, just as there is an uncertainty in values such as country means due to the use of a sample of students. This uncertainty that results from the link item sampling is referred to as “linking error,” and this error must be taken into account when making certain comparisons between previous rounds of PISA (2000, 2003, 2006, 2009, 2012, and 2015) and PISA 2018 results. Just as with the error that is introduced through the process of sampling students, the exact magnitude of this linking error cannot be determined. We can, however, estimate the likely range of magnitudes for this error and take this error into account when interpreting PISA results. As with sampling errors, the likely range of magnitude for the errors is represented as a standard error.
|Exhibit 1: Standard errors of linking performance between PISA 2018 and previous cycles|
|2000 vs. 2018||†||4.04||†||†|
|2003 vs. 2018||2.80||7.77||†||†|
|2006 vs. 2018||3.18||5.24||3.47||†|
|2009 vs. 2018||3.54||3.52||3.52||†|
|2012 vs. 2018||3.34||3.74||4.01||5.55|
|2015 vs. 2018||2.33||3.93||1.51||9.37|
|† Not applicable.|
|NOTE: Comparisons between PISA 2018 scores and previous assessments can only be made to when the subject first became a major domain or later assessment cycles. As a result, comparisons of reading can be made as far back as PISA 2000; mathematics comparisons can be made as far back as PISA 2003; and science comparisons can be made as far back as PISA 2006.|
|SOURCE: Organization for Economic Cooperation and Development (OECD), Program for International Student Assessment (PISA), 2018.|
In PISA, in each of the three subject matter areas, a common transformation was estimated from the link items, and this transformation was applied to all participating education systems when comparing achievement scores over time. It follows that any uncertainty that was introduced through the linking is common to all students and all education systems. Thus, for example, suppose the unknown linking error (between PISA 2015 and PISA 2018) in reading literacy resulted in an over-estimation of student scores by about four points on the PISA 2015 scale. It follows that every student’s score will be over-estimated by four score points. This over-estimation will have effects on certain, but not all, summary statistics computed from the PISA 2018 data. For example, consider the following:
In general terms, the linking error need only be considered when comparisons are being made between PISA 2015 and PISA 2018 results, and then usually only when group means are being compared. The most obvious example of a situation where there is a need to use linking error is in the comparison of the mean performance for a single education system between PISA 2015 and PISA 2018. For example, let us consider a comparison between 2015 and 2018 of the performance of the United States in reading. The mean performance of the United States in 2015 was 497 with a standard error of 3.42, while in 2018 the mean was 505 with a standard error of 3.57. Using rounded mean values, the standardized difference in the U.S. means is -1.267, which is computed as follows:
and is not statistically significant.
7 A .05 probability implies that the t statistic is among the 5 percent most extreme values one would expect if there were no difference between the means. The decision rule is that when t statistics are this extreme, the samples represent populations that likely have different means.
This section describes the success of participating education systems in meeting the international technical standards on data collection. Information is provided for all participating education systems on their coverage of the target population, exclusion rates, and response rates.
Table A-3 provides information on weighted school participation rates before and after school replacement and the number of participating schools after replacement for each participating education system. Table A-4 provides information on coverage of the target population, overall exclusion rates, weighted student response rates after school replacement, and the number of participating students after replacement for each participating education system.
In the United States, 136 original schools and 26 replacement schools participated in the 2018 administration of PISA. This resulted in 162 participating schools and an overall weighted school response rate of 76 percent. In the United States, 4,811 15-year-old students took part in the core PISA 2018 assessment and 1,520 15-year-old students took part in the financial literacy assessment. The U.S. overall student exclusion rate was 3.8 percent.
See section on International Requirements, above, for PISA international sampling guidelines and requirements regarding accommodations, exclusions, and response rate requirements, as well as response rates of all participating education systems. More detailed information may be found in the PISA 2018 Technical Report.
|Table A-3. Weighted participation rates and number of participating schools, by education system: 2018|
|Education system||Weighted school participation before replacement||Weighted school participation after replacement||Number of participating schools after replacement|
|Bosnia and Herzegovina||99.9||100.0||213|
|Hong Kong (China)||69.4||78.6||136|
|United Arab Emirates||99.4||99.4||754|
|NOTE: In calculating school participation rates, each school received a weight equal to the product of its base weight (the reciprocal of its probability of selection) and the number of age-eligible students enrolled in the school, as indicated on the sampling frame. Weighted school participation before replacement refers to the sum of weights of the original sample schools with PISA-assessed students and a student response rate of at least 50 percent over the sum of weights of all original sample schools. Weighted school participation after replacement refers to the sum of weights of the original and replacement schools with PISA-assessed students and a student response rate of at least 50 percent over the sum of weights of responding original sample schools, responding replacement schools, and eligible refusing original sample schools. Italics indicate non-OECD countries and education systems. B-S-J-Z (China) refers to the four PISA participating China provinces: Beijing, Shanghai, Jiangsu, and Zhejiang. Although Vietnam participated in 2018, technical problems with its data prevent results from being discussed in this report.|
|Table A-4. Coverage of target population, student exclusion and weighted participation rates, and number of students, by education system: 2018|
|Education system||Total population of 15-year-olds (number)||Coverage of 15-year-old population||Coverage of national desired population||Overall student exclusion rate||Weighted student participation after replacement||Number of participating students|
|Bosnia and Herzegovina||35,056||82.3||98.9||1.1||95.6||6,480|
|Hong Kong (China)||51,935||98.4||98.7||1.3||85.3||5,706|
|United Arab Emirates||59,275||91.8||98.0||2.0||95.6||19,265|
|NOTE: In calculating student participation rates, each student received a weight (student base weight) equal to the product of the school base weight—for the school in which the student was enrolled—and the reciprocal of the student selection probability within the school. Coverage of 15-year-old population refers to the extent to which the weighted participants covered the target population of all enrolled students in grades 7 and above. Coverage of national desired population refers to the extent to which the weighted participants covered the national population of 15-year-olds under the non-excluded portion of the student sample. Overall student exclusion rate is the percentage of students excluded for intellectual or functional disabilities, or insufficient assessment language experience at either the school level or within schools. Weighted student participation after replacement refers to the sum of weights of students in original and replacement schools with PISA-assessed students and a student response rate of at least 50 percent over the sum of weights of students in responding original sample schools, responding replacement schools, and eligible refusing original sample schools. Italics indicate non-OECD countries and education systems. B-S-J-Z (China) refers to the four PISA participating China provinces: Beijing, Shanghai, Jiangsu, and Zhejiang. Although Vietnam participated in 2018, technical problems with its data prevent results from being discussed in this report.|
Since the U.S. PISA weighted school response rates are below 85 percent, NCES requires an investigation into the potential magnitude of nonresponse bias at the school level in the U.S. sample. The investigation into nonresponse bias at the school level for the U.S. PISA effort shows statistically significant relationships between response status and some of the available school characteristics that were examined in the analyses.
The general approach taken involves an analysis in three parts as described below.
The first analysis indicates the potential for nonresponse bias that was introduced through school nonresponse. The second analysis suggests the remaining potential for nonresponse bias after the mitigating effects of substitution have been accounted for. The third analysis indicates the potential for bias after accounting for the mitigating effects of both substitution and nonresponse weight adjustments. Both the second and third analyses, however, may provide an overly optimistic scenario, resulting from the fact that substitution and nonresponse adjustments may correct somewhat for deficiencies in the characteristics examined, but there is no guarantee that they are equally as effective for other characteristics and, in particular, for student achievement.
In addition to these tests, logistic regression models were used to provide a multivariate analysis that examined the conditional independence of these school characteristics as predictors of participation. The logistic regression compared frame characteristics for participating schools with non-participating schools, which is effectively the same as comparing the participating schools to the eligible sample as in the bivariate analysis.
Multivariate analysis can provide additional insights, over and above those gained through the bivariate analysis. It may be the case that only one or two variables are actually related to participation status. However, if these variables are also related to the other variables examined in the analyses, then other variables, which are not related to participation status, will appear as significant in simple bivariate tables. Multivariate analysis, in contrast, examines the conditional relationships with participation after controlling for the other predictor variables—thereby, testing the robustness of the relationships between school characteristics and participation.
Participating PISA schools and the total eligible PISA school sample were compared by as many school sampling frame characteristics as possible that might provide information about the presence of nonresponse bias. Comparing frame characteristics between participating schools and the total eligible school sample is not an ideal measure of nonresponse bias if the characteristics are unrelated or weakly related to more substantive items in the survey; however, often it is the only approach available since PISA data are not available for nonparticipating schools. While the school-level characteristics used in these analyses are limited to those available in the sampling frame, each of the variables had a demonstrated relationship to achievement in previous PISA cycles.
A summary of the findings is provided below. Additional details on the nonresponse bias analysis can be found in NCES’ Technical Report and User Guide for the 2018 Program for International Student Assessment (PISA) (Kastberg et al. forthcoming).
For original sample schools (not including substitute schools), nine variables were found to be statistically significantly related to participation in the bivariate analysis: school control, census region, poverty level, total school and age-eligible enrollments, White, non-Hispanic, Black, non-Hispanic, Hispanic, and free or reduced-price lunch. Additionally, the absolute value of the relative bias for small sized and large sized schools, American Indian or Alaska Native, and Hawaiian/Pacific Islander is greater than 10 percent, which indicates potential bias even though no statistically significant relationship was detected. Although each of these findings indicates some potential for nonresponse bias, when all of the factors (with seven race/ethnicity variables) were considered simultaneously in a regression analysis, the Northeast region, high poverty, and Two or more races were significant predictors of school participation. The second model (with summed race/ethnicity percentage) showed that high poverty was a significant predictor of participation. When all of the parameter estimates (with seven race/ethnicity variables) were considered simultaneously in a regression analysis, the Northeast region, high poverty, and Two or more races were significant predictors of participation. The second model (with summed race/ethnicity percentage) showed that high poverty and the summed race/ethnicity percentage were significant predictors of participation. The third model (with summed race/ethnicity percentage using public schools only) showed that high poverty was a significant predictor of school participation among public schools only.
For the final sample of schools (with substitute schools) with school nonresponse adjustments applied to the weights, no variables were found to be statistically significantly related to participation in the bivariate analysis. However, the absolute value of the relative bias for small sized schools and Hawaiian/Pacific Islander is greater than 10 percent. The multivariate regression analysis cannot be conducted after the school nonresponse adjustments are applied to the weights. The concept of nonresponse-adjusted weights does not apply to the nonresponding units, and, thus, we cannot conduct an analysis that compares respondents with nonrespondents using nonresponse-adjusted weights.
In sum, the investigation into nonresponse bias at the school level in the U.S. PISA 2018 data provides evidence that there is some potential for nonresponse bias in the PISA participating original sample based on the characteristics studied. It also suggests that, while there is some evidence that the use of substitute schools reduced the potential for bias, it has not reduced it substantially. However, after the application of school nonresponse adjustments, there is little evidence of resulting potential bias in the available frame variables and correlated variables in the final sample.
8 The size-adjusted weight modifies the PPS weight so that schools with relatively small number of students (and large school base weights) won't influence the results more than schools with relatively large number of students (and small school base weights).
Judkins, D.R. (1990). Fay’s method for variance estimation. Journal of Official Statistics 6 (3), 223–239.
Kastberg, D., Perkins, R., Cummings, L., Ferraro, D., and Goodnow, M. (Forthcoming). Technical Report and User Guide for the 2018 Program for International Student Assessment (PISA). (Forthcoming). U.S. Department of Education. Washington, D.C.: National Center for Education Statistics.
Krenzke, T., Roey, S. Dohrmann, S.M., Mohadjer, L., Haung, W-C., Kaufman, S., and Seastrom, M. (2006). Tactics for Reducing the Risk of Disclosure Using the NCES DataSwap Software. Proceedings of the American Sociological Association: Survey Research Methods Section. Philadelphia: American Sociological Association.
Organization for Economic Cooperation and Development (OECD). (2015). PISA 2018 Technical Standards. Paris: Author. Available online at http://www.oecd.org/pisa/pisaproducts/PISA-2018-Technical-Standards.pdf.
Organization for Economic Cooperation and Development (OECD). (Forthcoming). PISA 2018 Technical Report. Paris: Author.