- Introduction
- Adult Literacy and Lifeskills Survey (ALL)
- Academic Libraries Survey (ALS)
- Baccalaureate and Beyond (B&B) Longitudinal Study
- Beginning Postsecondary Students (BPS) Longitudinal Study
- Beginning Teacher Longitudinal Study(BTLS)
- Common Core of Data (CCD)
- Current Population Survey (CPS)
- Early Childhood Longitudinal Study, Birth Cohort (ECLS-B)
- Early Childhood Longitudinal Study, Kindergarten Class of 1998‑99 (ECLS‑K)
- Early Childhood Longitudinal Study, Kindergarten Class of 2010–11 (ECLS‑K:2011)
- Educational Longitudinal Study of 2002 (ELS 2002)
- Quick Response Information System (FRSS/PEQIS)
- High School and Beyond (HS&B) Longitudinal Study
- High School Longitudinal Study of 2009 (HSLS:09)
- High School Transcript Studies (HSTS)
- International Adult Literacy Survey (IALS)
- Integrated Postsecondary Education Data System (IPEDS)
- Middle Grades Longitudinal Study of 2017– 18 (MGLS:2017)
- National Assessment of Adulty Literacy (NAAL)
- National Assessment of Educational Progress (NAEP)
- National Adult Literacy Survey (NALS)
- National Education Longitudinal Study of 1988 (NELS 88)
- National Household Education Survey Program (NHES)
- National Longitudinal Study of the High School Class of 1982 (NLS 72)
- National Postsecondary Student Aid Study (NPSAS)
- National Study of Postsecondary Faculty (NSOPF)
- National Teacher and Principal Survey (NTPS)
- Program for the International Assessment of Adult Competencies (PIAAC)
- Progress In International Reading Literacy Study (PIRLS)
- Program For International Student Assessment (PISA)
- Private School Universe Survey (PSS)
- School and Staffing Survey (SASS)
- SASS Principal Follow-Up Survey (PFS)
- SASS School Library Media Center Survey (SLS)
- SASS Teacher Follow-up Survey (TFS)
- Survey of Earned Doctorates (SED)
- Crime and Safety Surveys (SCS & SSOCS)
- Teaching and Learning International Survey (TALIS)
- Trends In International Mathematics and Science Study (TIMSS)
Program for International Student Assessment (PISA)
4. SURVEY DESIGN
The survey design for PISA data collections is discussed in this section.
TARGET POPULATION
The desired PISA target population consisted of 15-year-old students attending public or private educational institutions located within the jurisdiction, in grades 7 through 12. Jurisdictions were to include 15-year-old students enrolled either full time or part time in an educational institution, in a vocational training or related type of educational program, or in a foreign school within the jurisdiction (as well as students from other jurisdictions attending any of the programs in the first three categories). It was recognized that no testing of persons schooled in the home, workplace, or out of the jurisdiction occurred; therefore, these students were not included in the international target population.
The operational definition of an age population depends directly on the testing dates. International standards required that students in the sample be 15 years and 3 months to 16 years and 2 months at the beginning of the testing period. The technical standard for the maximum length of the testing period was 42 consecutive days. Most education systems conducted testing from March through August 2018. The United States and the United Kingdom were given permission to move the testing dates to October through November in an effort to improve response rates. In the United States, students born between July 1, 2002, and June 30, 2003, were eligible to participate in PISA 2018.
The U.S. PISA 2018 national school sample consisted of 257 schools. This number represents an increase from the international minimum requirement of 150 and was implemented to offset anticipated school nonresponse and reduce design effects. Schools were selected with probability proportionate to the school's estimated enrollment of 15-year-olds. The data for public schools were from the 2015–16 Common Core of Data (CCD) and the data for private schools were from the 2015–16 Private School Universe Survey (PSS). Any school containing at least one of grades 7 through 12 was included in the school sampling frame. Participating schools provided a list of 15-year-old students (typically in August or September 2018) from which the sample was drawn using sampling software provided by the international contractor.
INTERNATIONAL SAMPLE DESIGN
The sample design for PISA 2018 was a stratified systematic sample, with sampling probabilities proportional to the estimated number of 15-year-old students in the school based on grade enrollments. Samples were drawn using a two-stage sampling process. The first stage was a sample of schools, and the second stage was a sample of students within schools. The PISA international contractors responsible for the design and implementation of PISA internationally (hereafter referred to as the PISA consortium) drew the sample of schools for each economy.
The international guidelines specified that within schools, a sample of 42 students was to be selected in an equal probability sample unless fewer than 42 students age 15 were available (in which case all 15-year-old students were selected). The target cluster size for countries/economies participating in the international option of financial literacy (FL) was increased to 52 students. A minimum of 6,300 students from a minimum of 150 schools was required in each country that planned to administer computer-based assessments. Education systems that opted to conduct paper-based assessments were required to assess a minimum of 5,250 students from a minimum of 150 schools. Following the PISA consortium guidelines, replacement schools were identified at the same time the PISA sample was selected by assigning the two schools neighboring the sampled school in the frame as replacements. For countries administering financial literacy, an additional sample of students was selected. If a jurisdiction had fewer than 5,250 eligible students, then the sample size was the national defined target population. The national defined target population included all eligible students in the schools that were listed in the school sampling frame.
International within-school exclusion rules for students were specified as follows:
- Students with functional disabilities. These were students with a moderate to severe permanent physical disability such that they could not perform in the PISA testing environment.
- Students with intellectual disabilities. These were students with a mental or emotional disability who had been tested as cognitively delayed or who were considered in the professional opinion of qualified staff to be cognitively delayed such that they could not perform in the PISA testing situation.
- Students with insufficient language experience. These were students who met the three criteria of (1) not being a native speaker in the assessment language, (2) having limited proficiency in the assessment language, and (3) having received less than a year of instruction in the assessment language. In the United States, English was the exclusive language of the assessment.
A school attended only by students who would be excluded for functional, intellectual, or linguistic reasons was considered a school-level exclusion. International exclusion rules for schools allowed for schools in remote regions, very small schools, and special education schools to be excluded. School-level exclusions for inaccessibility, feasibility, or other reasons were required to cover fewer than 0.5 percent of the total number of students in the international PISA target population. International guidelines state that no more than 5 percent of a jurisdiction's desired national target population should be excluded from the sample.
Response Rate Targets
School response rates. The PISA international guidelines for the 2018 assessment required that jurisdictions achieve an 85 percent school response rate. However, while stating that each jurisdiction must make every effort to obtain cooperation from the sampled schools, the requirements also recognized that this is not always possible. Thus, it was allowable to use substitute, or replacement, schools as a means to avoid loss of sample size associated with school nonresponse. The international guidelines stated that at least 65 percent of participating schools must be from the original sample. Education systems were only allowed to use replacement schools (selected during the sampling process) to increase the response rate once the 65 percent benchmark had been reached.
Each sampled school was to be assigned two replacement schools in the sampling frame. If the original sampled school refused to participate, a replacement school was asked to participate. One sampled school could not substitute for another sampled school, and a given school could only be assigned to substitute for one sampled school. A requirement of these substitute schools was that they be in the same explicit stratum as the original sampled school. The international guidelines define the response rate as the number of participating schools (both original and replacement schools) divided by the total number of eligible original sampled schools.2
Student response rates. The international technical standards required a minimum participation rate of 80 percent of sampled students from schools (sampled and replacement) within each jurisdiction. This target applied in aggregate, not to each individual school. Follow-up sessions were required in schools where too few students participated in the originally scheduled test sessions to ensure a high overall student response rate. Replacement students within a school were not allowed. A student was considered to be a participant if he or she participated in the first testing session or a follow-up or makeup testing session.
Within each school, a student response rate of 50 percent was required for a school to be regarded as participating: the overall student response rate was computed using only students from schools with at least a 50 percent response rate. Weighted student response rates were used to determine if this standard was met; each student's weight was the reciprocal of his or her probability for selection into the sample.
Sample Design in the United States
The PISA 2018 school sample was drawn for the United States by the PISA consortium. The U.S. PISA sample was stratified into 8 explicit groups based on region of the country (Northeast, Central, West, Southeast) and control of school (public or private). Within each stratum, the frame was sorted for sampling by five categorical stratification variables: grade range of the school (five categories); type of location relative to populous areas (city, suburb, town, rural); combined percentage of Black, Hispanic, Asian, Native Hawaiian/Pacific Islander, and American Indian/Alaska Native students (above or below 15 percent); gender (mostly female (percent female ≥ 95 percent), mostly male (percent female < 5 percent), and other); and state.
The U.S. PISA 2018 national school sample consisted of 257 schools, which was higher than the international sampling minimum of 150 to offset anticipated school nonresponse and ineligibility. The U.S. national sample included both public and private schools. Of the 52 students who were randomly sampled within each school, 41 students took the mathematics, science and reading literacy assessments and 11 students took the optional financial literacy assessment. The group of students who took the financial literacy assessment were referred to as the "financial literacy sample". Note that this was different from the approach used in the 2015 cycle, when financial literacy was administered to a subset of the students in the main PISA sample.
A total of 162 schools participated in the administration of U.S. national PISA, including 136 participating schools sampled as part of the original sample and 26 schools sampled as replacements for nonparticipating "original" schools. The overall weighted school response rate after replacements was 76 percent. For the United States as a whole, the weighted student response rate was 85 percent and the student exclusion rate was 4 percent.
In addition to the international response rate standards described in the prior section, the U.S. sample had to meet the statistical standards of the National Center for Education Statistics (NCES) of the U.S. Department of Education. For an assessment like PISA, NCES requires that a nonresponse bias analysis be conducted when the response rate for schools falls below 85 percent or the response rate for students falls below 85 percent.
Assessment Design
Test scope and format. In PISA 2018, the three subject domains were tested, with reading literacy as the major domain and science and mathematics as the minor domains. Financial literacy was an optional domain administered by 21 education systems including the United States. An innovative (optional) domain in this cycle was global competence and the United States didn't participate in this domain.
The development of the PISA 2018 assessment instruments was an interactive process among the PISA Consortium, various expert committees, and OECD members. All mathematics and science items in the 2018 assessment instrument were trend items from previous assessments. Reading literacy and financial literacy items included both trend items and new items developed for 2018. Representatives of each jurisdiction reviewed the items for possible bias and for relevance to PISA's goals. The intention was to reflect in the assessment the national, cultural, and linguistic variety of the OECD jurisdictions. Following a field trial that was conducted in most jurisdictions, test developers and expert groups considered a variety of aspects in selecting the items for the main study: (a) the results from the field trial, (b) the outcome of the item review from jurisdictions, and (c) queries received about the items.
PISA 2018 was a computer-based assessment in most jurisdictions, including the United States. Test formats included multiple-choice and open response. Approximately 60 to 65 percent of items were multiple-choice and 35 to 40 percent were open response across reading, mathematics and science. Open response items were graded by trained scorers.
Multiple-choice items were either (a) standard multiple choice, with a limited number (usually four) of responses from which students were required to select the best answer; or (b) complex multiple choice, which presented several statements, each of which required students to choose one of several possible responses (true/false, correct/incorrect, etc.). Closed- or short-response items included items that required students to construct their own responses from a limited range of acceptable answers or to provide a brief answer from a wider range of possible answers, such as mathematics items requiring a numeric answer, and items requiring a word or short phrase. Open constructed-response items required more extensive writing, or showing a calculation, and frequently included some explanation or justification. Pencils, erasers, rulers, and (in some cases) calculators were provided.
Test design. PISA 2018 computer-based assessment was designed as a two-hour test. For the main subject of reading, material equivalent to 15 clusters of 30 minutes was developed. A multi-stage adaptive approach in reading assessment was adopted in PISA 2018. The reading material was organized into blocks instead of clusters. The reading assessment was composed of a core stage followed by stage 1 and stage 2. Students first saw a short Core stage, which consisted of between 7 and 10 items. The vast majority of these items (at least 80 percent and always at least 7 items) were automatically scored. Students' performance in this stage was provisionally classified as low, medium, or high, depending on the number of correct answers to these automatically scored items. The various Core Blocks of material delivered to students did not differ in any meaningful way in their difficulty. Stage 1 and 2, however, both existed in two different forms: comparatively easy and comparatively difficult. Students who displayed medium performance in the Core stage were equally likely to be assigned an easy or a difficult Stage 1. Students who displayed low performance in the Core stage had a 90 percent chance of being assigned to an easy Stage 1 and a 10 percent chance of being assigned to a difficult Stage 1. Students who displayed high performance in the Core stage had a 90 percent chance of being assigned to a difficult Stage 1 and a 10 percent chance of being assigned to an easy Stage 1. Students were assigned to easy and difficult Stage 2 blocks of material in much the same way.
In PISA 2018, in addition to the typical reading literacy items, the reading literacy assessment included a measure of reading fluency in the form of sentence processing. This measure required students to make a sensibility judgment about sentences of increasing complexity. Each student was assigned two fluency clusters for a total of 21 or 22 sentences before the reading literacy clusters. The reading fluency task were administered within a 3-minute timed session. Any sentences not completed within the three-minute session were skipped.
To measure trends in the subjects of mathematics and science, six clusters were included in each subject. In addition, four clusters of global competence items were developed. There was a total of 72 different test forms. Students spent one hour on the reading assessment plus one hour on one or two other subjects – mathematics, science or global competence. The financial literacy assessment lasted one hour (in addition to the regular PISA assessment) and comprised two clusters distributed to a subsample of students in combination with the reading and mathematics assessments.
For countries like the United States that took part in the core computer-based assessment (CBA) and the optional financial literacy assessment but did not opt to take part in the optional global competency domain, a total of 36 CBA testing forms were assembled for the assessment. 92 percent of students received forms numbered 1–24 of these 36 forms while 8 percent of students received forms numbered 25–36. These percentages are based on random assignment of test forms to students across schools.
Countries that used paper-based delivery for the main survey measured student performance with 30 pencil-and-paper forms containing trend items in the three core PISA subjects: reading, mathematics and science. Each form included one hour of reading items and items from at least one of the other two core domains. As a result, all students were administered two clusters of reading items, 46 percent of participating students were administered two clusters of mathematics items, 46 percent were administered two clusters of science items, and 8 percent were administered one cluster of mathematics and one cluster of science items, thus providing the covariance information about the three domains.
Data Collection and Processing
PISA 2018 was coordinated by the OECD and managed at the international level by the PISA Consortium. PISA is implemented in each education system by a National Project Manager (NPM). In the United States, the NPM works with a national data collection contractor to implement procedures prepared by the PISA Consortium and agreed to by the participating jurisdictions. In 2018, the U.S. national data collection contractor was Westat as well as a subcontractor, Pearson.
The 2018 PISA multicycle study was again collaboration between the governments of participating countries, the Organization for Economic Cooperation and Development (OECD), and a consortium of various international organizations, referred to as the PISA Consortium. This consortium in 2018 consisted of the Educational Testing Service (ETS), the U.S. research company Westat, cApStAn Linguistic Quality Control, Pearson, the German Institute for International Education Research (DIPF), Statistics Canada in Canada, University of Liège – aSPE in Belgium, University of Luxembourg in Luxembourg, and Australian Council for Educational Research (ACER) in Australia.
Reference dates. Each economy collected its own data, following international guidelines and specifications. The technical standards required that students in the sample be 15 years and 3 months to 16 years and 2 months at the beginning of the testing period. Most education systems conducted testing from March through August 2018. The United States and the United Kingdom were given permission to move the testing dates to September through December in an effort to improve response rates. The range of eligible birth dates was adjusted so that the mean age remained the same (i.e., 15 years and 3 months to 16 years and 2 months at the beginning of the testing period). In 2003, the United States conducted PISA in the spring and fall and found no significant difference in student performance between the two time points.
Incentive. School packages were mailed to principals in mid-September with phone contact from recruiters beginning a few days after the mailing. As part of the PISA 2012 school recruitment strategy, the materials included a description of school and student incentives. Schools and school coordinators were each paid $200, and students received $25 and 4 hours of community service for participating in the paper-based session and an additional $15 if they were selected and participated in the computer-based assessment.
Data collection. The PISA consortium emphasized the use of standardized procedures in all education systems. Each economy collected its own data, based on detailed manuals provided by the PISA consortium (Westat 2014) that explained the survey's implementation, including precise instructions for the work of school coordinators and test administrators and scripts for test administrators to use in testing sessions. Test administration in the United States was conducted by professional staff trained in accordance with the international guidelines. Students were allowed to use calculators, and U.S. students were provided calculators.
In each education system, a PISA Quality Monitor (PQM) who was engaged independently by the PISA consortium observed test administrations in a subsample of participating schools. The schools in which the independent observations were conducted were selected jointly by the PISA consortium and the PQM. In the United States, there were five PQMs who observed 15 schools from the national sample. The PQM's primary responsibility was to document the extent to which testing procedures in schools were implemented in accordance with test administration procedures. The PQM's observations in U.S. schools indicated that international procedures for data collection were applied consistently.
Scoring. A substantial portion of the PISA 2018 assessment was devoted to open constructed-response items. The process of scoring these items is an important step in ensuring the quality and comparability of the PISA data. Detailed guidelines were developed for the scoring guides themselves, training materials to recruit scorers, and workshop materials used for the training of national scorers. Prior to the national training, the PISA Consortium organized international training sessions to present the material and train scoring coordinators from the participating jurisdictions, who in turn trained the national scorers.
For each test item, the scoring guides described the intent of the question and how to code students' responses. This description included the credit labels—full credit, partial credit, or no credit—attached to the possible categories of response. Also included was a system of double-digit coding for some mathematics and science items, where the first digit represented the score and the second digit represented the different strategies or approaches that students used to solve the problem. The second digit generated national profiles of student strategies and misconceptions. In addition, the scoring guides included real examples of students' responses accompanied by a rationale for their classification for purposes of clarity and illustration.
To examine the consistency of this marking process in more detail within each jurisdiction (and to estimate the magnitude of the variance components associated with the use of scorers), the PISA Consortium generated an inter-rater reliability report on a subsample of assessment booklets. The results of the homogeneity analysis showed that the marking process of items is largely satisfactory and that on average countries are more or less reliable in the coding of the open-ended responses.
In PISA 2018, the process used for the main survey coding training was slightly different from that employed prior to the field trial as it included full training for all main survey items, both new and trend items. The coder query service was again used in the main survey as it had been in the field trial to assist countries in clarifying any uncertainty around the coding process or students' responses. Queries were reviewed, and responses were provided by domain-specific teams including item developers and members of the response team from previous cycles. Revisions were made to the coding guides for reading and global competence following the field trial and field trial pilot, respectively. The coder queries helped test developers see response categories that weren't anticipated during the development of the coding guide. Thus, based on the queries received, test developers made some coding guides clearer and added sample responses to the guides to better illustrate different types of responses.
Data entry and verification. In PISA 2018, a National Project Manager (NPM) in each jurisdiction was responsible for administering the assessments and collecting data files following a common international format. Variables could be added or deleted as needed for different national options; approved adaptations to response categories could also be accommodated. The Student Delivery System (or SDS), a self-contained set of applications, was used to deliver the PISA 2018 CBA assessments and computer-based student background questionnaires. A master version was assembled first for countries to test within their national IT structure. This allowed countries to become familiar with the operation of the SDS and to check the compatibility of the software with computers being used to administer the assessment. After all components of national materials were locked, including the questionnaires and cognitive instruments, the student delivery system was assembled and tested first. The SDS was then released to countries for national testing. Countries were asked to check their SDS following a specific testing plan provided and to identify any residual content or layout issues. Where issues were identified, those were corrected and a second SDS was released. Once countries signed off on their national SDS, their instruments were released for the field trial and the main survey.
Harmonization or harmonizing variables is a process of mapping the national response categories of a particular variable into the international response categories so they can be compared and analyzed across countries. Not every nationally-adapted variable required harmonization, but for those that required harmonization, the Data Management team assisted the Background Questionnaire contractor with creating the harmonization mappings for each country with SAS code. This code was implemented into the data management cleaning and verification software in order to handle these harmonized variables during processing. ETS Data Management collaborated with the Background Questionnaire contractor to develop a series of validation checks that were performed on the data following harmonization.
Estimation Methods
Weighting. The use of sampling weights is necessary for computing statistically sound, nationally representative estimates. Adjusted survey weights adjust for the probabilities of selection for individual schools and students, for school or student nonresponse, and for errors in estimating the size of the school or the number of 15-year-olds in the school at the time of sampling. Survey weighting for all education systems participating in PISA 2018 was coordinated by Westat, as part of the international PISA consortium.
The school base weight was defined as the reciprocal of the school's probability of selection multiplied by the number of eligible students in the school. (For replacement schools, the school base weight was set equal to the original school it replaced.) The student base weight was given as the reciprocal of the probability of selection for each selected student from within a school.
The product of these base weights was then adjusted for school and student nonresponse. The school nonresponse adjustment was done individually for each education system by cross-classifying the explicit and implicit stratification variables defined as part of the sample design.
The student nonresponse adjustment was done within cells based first on their school nonresponse cell and their explicit stratum; within that, grade and gender were used when possible.
All PISA analyses were conducted using these adjusted sampling weights.
Scaling. For PISA 2018, item response theory (IRT) was used to estimate average scores for reading, science, and mathematics literacy for each education system, as well as for three reading process and three reading content subscales. For education systems participating in the financial literacy assessment these assessments were scaled separately and assigned separate scores. Scores for students were estimated as plausible values because each student completed only a subset of items. Ten plausible values were estimated for each student for each scale. These values represented the distribution of potential scores for all students in the population with similar characteristics and identical patterns of item response. Statistics describing performance on the PISA reading, science, mathematics, and financial literacy scales are based on plausible values. In PISA, the reading, science, mathematics and financial literacy scales are from 0—1,000.
The PISA 2015 main study computer-based assessment included six clusters from each of the trend domains of science, reading, and mathematics literacy, six clusters of new science literacy test items, and three clusters of new collaborative problem-solving materials. The clusters were allocated in a rotated design to create six groups of test forms. Every student taking the assessment answered science items, and at least one but up to two of the other subjects of mathematics literacy, reading literacy, and/or collaborative problem solving. Students who were subsampled for the financial literacy assessment returned for a second session in which the focus was only on financial literacy and the accompanying student questionnaire.
The fact that each student completed only a subset of items means that classical test scores, such as the percent correct, are not accurate measures of student performance. Instead, scaling techniques were used to establish a common scale for all students.
In PISA 2009, item response theory (IRT) was used to estimate average scores in each jurisdiction for science, mathematics, and reading literacy, as well as for three reading literacy subscales: integrating and interpreting, accessing and retrieving, and reflecting and evaluating. Subscale scores were not available for mathematics literacy or science literacy for 2009 because not all students answered science and/or mathematics items.
IRT identifies patterns of response and uses statistical models to predict the probability of a student answering an item correctly as a function of his or her proficiency in answering other questions. PISA 2009 used a mixed coefficients multinomial logit IRT model. This model is similar in principle to the more familiar two-parameter logistic IRT model. With the multinomial logit IRT model, the performance of a sample of students in a subject area or subarea can be summarized on a simple scale or series of scales, even when students are administered different items.
For PISA 2012, IRT was used to estimate average scores for mathematics, science, and reading literacy for each economy, as well as for three mathematics process and four mathematics content scales. For education systems participating in the financial literacy assessment and the computer-based assessment, these assessments were scaled separately and assigned separate scores.
For PISA 2015, IRT was used to estimate average scores for science, reading, and mathematics literacy for each economy, as well as for three science process and three science content subscales. For education systems participating in the financial literacy assessment and the collaborative problem-solving assessment, these assessments were scaled separately and assigned separate scores.
Plausible values. Scores for students are estimated as plausible values because each student completed only a subset of items. These values represent the distribution of potential scores for all students in the population with similar characteristics and identical patterns of item response. It is important to recognize that plausible values are not test scores and should not be treated as such. Plausible values are randomly drawn from the distribution of scores that could be reasonably assigned to each individual. As such, the plausible values contain random error variance components and are not optimal as scores for individuals. Ten plausible values were estimated for each student for each scale in PISA 2015 and PISA 2018. Thus, statistics describing performance on the PISA science, reading, and mathematics literacy scales are based on plausible values.
If an analysis is to be undertaken with one of these cognitive scales, then (ideally) the analysis should be undertaken five times, once with each of the ten relevant plausible value variables. The results of these ten analyses are averaged; then, significance tests that adjust for variation between the ten sets of results are computed.
Imputation. Missing background data from student and school questionnaires are not imputed for PISA 2009 reports. PISA 2015 also did not impute missing information for questionnaire variables.
In general, item response rates for variables discussed in NCES PISA reports exceed the NCES standard of 85 percent.
Measuring trends. Although science was assessed in 2000 and 2003, because the science framework was revised for 2006, it is possible to look at changes in science only from 2006 forward. Similarly, although reading was assessed in 2000, 2003, and 2006, and mathematics was assessed in 2000, because the reading framework was revised for PISA 2009 and the mathematics framework was revised for PISA 2003, it is possible to look at changes in reading only from 2009 forward and in mathematics only from 2003 forward. Although the PISA 2012 framework was updated, it is still possible to measure trends over time, as the underlying construct is intact. For specific trends in performance results, please see the NCES PISA website (https://nces.ed.gov/surveys/pisa/pisa2018/index.asp#/).
The PISA 2000, 2003, 2006, 2009, 2012, ,2015 and 2018 assessments of reading, mathematics, and science are linked assessments. That is, the sets of items used to assess each domain in each year include a subset of common items; these common items are referred to as link items. In PISA 2000 and PISA 2003, there were 28 reading items, 20 math items, and 25 science items that were used in both assessments. The same 28 reading items were retained in 2006 to link the PISA 2006 data to PISA 2003, The PISA 2009 assessment included 26 of these 28 reading items and a further 11 reading items from PISA 2000, not used since that administration, were also included in PISA 2009. The PISA 2012 assessment included 37 of these link items from 2009 as well as an additional 7 items included in 2009 to establish the reading trend scale. In mathematics, 48 math items from PISA 2003 were used in PISA 2006; PISA 2009 included 35 of the 48 mathematics items that were used in PISA 2006, and of these, 34 were used in PISA 2012. For the science assessment, 14 items were common to PISA 2000 and PISA 2006, and 22 items were common to PISA 2003 and PISA 2006. The science assessment for PISA 2012 consisted of 53 items that were used in PISA 2009 and 2006. All mathematics and reading items in the PISA 2015 assessment instrument were trend items from previous assessments. Science items included both trend items and new items developed for 2015. In PISA 2018, all mathematics and science items were trend items and reading items included 238 new items and 72 trend items.
To establish common reporting metrics for PISA, the difficulty of the link items, measured on different occasions, is compared. Using procedures that are detailed in the PISA 2018 Technical Report, the comparison of item difficulty on different occasions is used to determine a score transformation that allows the reporting of the data for a particular subject on a common scale. The change in the difficulty of the individual link items is used in determining the transformation; as a consequence, the sample of link items that has been chosen will influence the choice of transformation. This means that if an alternative set of link items had been chosen, the resulting transformation would be slightly different. The consequence is an uncertainty in the transformation due to the sampling of the link items, just as there is an uncertainty in values such as jurisdiction means due to the use of a sample of students.
Future Plans
The next cycle of PISA data collection will take place in 2021.
2 The calculation of response rates described here is based on the formula stated in the international guidelines and is not consistent with NCES standards. A more conservative way to calculate response rates would be to include participating replacement schools in the denominator as well as in the numerator and to add replacement schools that were hard refusals to the denominator. |