The Trends in International Mathematics and Science Study (TIMSS) is an international comparative study of the performance and schooling contexts of fourth- and eighth-grade students in mathematics and science. The 2023 administration marks the eighth cycle of TIMSS, in which mathematics and science assessments and associated questionnaires were administered in 65 education systems at the fourth-grade level and 47 education systems at the eighth-grade level during spring of 2023 (in the Northern Hemisphere) and the fall of 2023 (in the Southern Hemisphere).
TIMSS is coordinated by the International Association for the Evaluation of Educational Achievement (IEA), with governmental sponsors in each participating country or education system. In the United States, TIMSS is sponsored by the National Center for Education Statistics (NCES), in the Institute of Education Sciences of the U.S. Department of Education. NCES contracted RTI International to conduct the sampling and data collection activities in the United States.
These Methodology and Technical Notes provide an overview, with a focus on the U.S. implementation, of the following technical aspects of TIMSS 2023:
More detailed information can be found in the TIMSS 2023 international Technical Report (Methods and Procedures) at https://timss2023.org/methods. For a complete explanation of any of these topics, see the TIMSS 2023 U.S. technical report (forthcoming).
To ensure comparability of the TIMSS 2023 data across countries, IEA established a set of detailed international requirements for the various aspects of data collection. The requirements regarding the target populations, sampling design, sample size, exclusions, and defining participation rates are described below.
To identify comparable populations of students to be sampled, IEA defined the international TIMSS 2023 desired target population as follows:
Although participating education systems were expected to include all students in the International Target Population, sometimes it was not feasible to include all these students because of geographic or linguistic constraints specific to the country or territory. Thus, each participating education system had its own "national" desired target population (also referred to as the National Target Population), which was the International Target Population reduced by the exclusions of those sections of the population that were not possible to assess. Working from the National Target Population, each participating education system had to operationalize the definition of its population for sampling purposes (i.e., define their national defined target population [referred to as the National Defined Population]). Although each education system's National Defined Population ideally coincides with its National Target Population, there may be additional exclusions (e.g., of regions or school types) due to constraints of operationalizing the assessment (see section on Exclusions).
All mathematics and science teachers who taught the selected students in fourth and eighth grades were also selected to participate. Note that these teachers were not a representative sample of teachers within the country. Rather, they were the mathematics and science teachers who taught a representative sample of students in two grades within the country (fourth and eighth grades in the United States).
It was not feasible to assess every fourth- and eighth-grade student in each education system. Thus, a representative sample of fourth- and eighth-grade students was selected. The sample design employed by the TIMSS assessment is generally referred to as a two-stage stratified cluster sample. The sampling units at each stage were defined as follows:
TIMSS guidelines called for a minimum of 150 schools to be sampled per grade, with a minimum of 4,000 students assessed per grade. The basic sample design of one classroom per target grade per school was designed to yield a total sample of approximately 4,500 students per target grade per population. Education systems with small class sizes or fewer than 30 students per school were directed to consider sampling more schools, more classrooms per school, or both, to meet the minimum target of 4,000 tested students per grade.
All schools and students excluded from the National Defined Population are referred to as the excluded population. Exclusions could occur at the school level, with entire schools being excluded, or within schools, with specific students or entire classrooms excluded. In 2023, some accommodations were made available for students with disabilities and for students who were unable to read or speak the language of the test. The IEA requirement with regard to exclusions is that they should not exceed more than 5 percent of the National Target Population.
School exclusions. Education systems could exclude schools that
Within-school exclusions. Education systems were instructed to adopt the following international within-school exclusion rules to define excluded students:
In order to minimize the potential for response biases, IEA developed participation or response rate standards that apply to all participating education systems and govern both whether a participating education system's data are included in the TIMSS international database and the way in which national statistics are presented in the international reports. These standards were set using composites of response rates at the school, classroom, and student and teacher levels; moreover, response rates were calculated with and without the inclusion of substitute schools.
The response rate standards take the following two forms, distinguished primarily by whether the school participation rate of 85 percent was met:
Classrooms with less than 50 percent student participation were considered nonrespondents and were dropped from reporting by the IEA.
Participants satisfying the Category 1 standard are included in the international tabular presentations without annotation. Those able to satisfy only the Category 2 or Category 3 standard are included as well but are annotated to indicate their response rate status.
In the United States and most other participating education systems, the target populations of students corresponded to the fourth and eighth grades. In the United States, no regions or school types were excluded. The samples for the fourth- and eighth-grade data collections were selected using a two-stage stratified cluster design, as explained above in International Requirements for Sampling, Data Collection and Response Rates. In the first stage, schools were sampled using stratified systematic PPS sampling. In the second stage, classrooms were selected from participating schools. All students from selected classrooms were selected for the student sample.
First stage. The NCES 2019–20 Common Core of Data (CCD) and preliminary 2019–20 Private School Universe Survey (PSS) were used to construct fourth- and eighth-grade school sampling frames. The U.S. sampling frames included all schools from the CCD and PSS that reported enrollment in the respective grade, fourth or eighth, in the 50 United States and the District of Columbia.
The schools in each sampling frame were stratified4 by cross-classifying the following characteristics:
This yielded 12 school strata in each frame consisting of eight public school strata and four private school strata. Schools within each sampling stratum were sorted by the following implicit stratification variables:
Schools were selected from each frame using stratified systematic PPS sampling. A sample of 300 schools, each with two replacements, was selected from each frame to meet the target of 250 participating schools in each grade. The schools appearing just after and just before each sampled school in the corresponding sorted sampling frame were selected as the sampled school’s first and second replacement school, respectively. If a sampled school declined to participate, the first replacement school was contacted for recruitment. If the first replacement school declined to participate, was closed, or had no students enrolled in the desired grade, the second replacement school was contacted for recruitment. If a sampled school was found to be closed or had no students enrolled in the desired grade, no replacement schools were contacted for recruitment.
Second stage. Each participating school provided a list of all fourth-grade classes or eighth-grade mathematics classes that accounted for each student in the grade exactly once. Any class with fewer than 15 students was combined with another class to form a “pseudoclass,”8 so that each classroom in the school’s classroom sampling frame had at least 20 students. From this list, an equal probability sample of two classrooms or pseudoclasses were randomly selected to participate from each school. In schools with only one or two classrooms, the classroom(s) was selected with certainty. At both grade levels, pseudoclassrooms were created before classroom sampling when classroom sizes were small. All students in each selected class were selected for the student sample.
The overall sample design for the United States resulted in an approximately self-weighting sample9 of students, with each fourth- or eighth-grade student having a roughly equal probability of selection. Note that, in large schools, a smaller proportion of the classes (and therefore of the students) was selected, but this lower rate of selecting students in large schools was offset by a larger probability of selection of large schools, as schools are selected with PPS.
1 The ISCED was developed by the United Nations Educational, Scientific, and Cultural Organization (UNESCO) to facilitate the comparability of educational levels across countries. ISCED Level 1 begins with the first year of formal, academic learning (See UNESCO ISCED 2011 for more information). In the United States, ISCED Level 1 begins at grade 1.
2 Some sampled schools may be considered ineligible for reasons noted in the Exclusions section below.
3 Explicit strata are mutually exclusive subgroups of schools in the school sampling frame. School samples are selected independently from each explicit stratum. When school samples are selected using a systematic sampling approach, the sorting of schools in explicit strata ensure that school samples are proportionally distributed across the characteristics used to implement the sort. In this situation, the school sample is said to be implicitly stratified by the characteristics used in the sort.
4 The primary purpose of stratification is to improve the precision of the survey estimates. If explicit stratification of the population is used, the units of interest (schools, for example) are sorted into mutually exclusive subgroups—strata. Units in the same stratum are as homogeneous as possible, and units in different strata are as heterogeneous as possible with respect to the characteristics of interest to the survey. Separate samples are then selected from each stratum. In the case of implicit stratification, the units of interest are simply sorted with respect to one or more variables known to have a high correlation with the variable of interest. In this way, implicit stratification guarantees that the sample of units selected was spread across the categories of the stratification variables when using systematic sampling.
5 The Census definitions of region were used. The Northeast region consists of Connecticut, Maine, Massachusetts, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, and Vermont. The Midwest region consists of Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Missouri, Nebraska, North Dakota, Ohio, South Dakota, and Wisconsin. The South region consists of Alabama, Arkansas, Delaware, District of Columbia, Florida, Georgia, Kentucky, Louisiana, Maryland, Mississippi, North Carolina, Oklahoma, South Carolina, Tennessee, Texas, Virginia, and West Virginia. The West region consists of Alaska, Arizona, California, Colorado, Hawaii, Idaho, Montana, Nevada, New Mexico, Oregon, Utah, Washington, and Wyoming.
6 High-poverty schools are those with 76 percent or more of students eligible for free or reduced-price lunch (FRPL), while low-poverty schools have below 76 percent eligible. No FRPL program data were available for private schools; thus, all private schools are considered low-poverty schools. Note that TIMSS 2019 used a threshold of 50 percent of students eligible for FRPL as the cut-off for high-poverty schools. School sample sizes were set approximately proportional to the numbers of students enrolled in schools by region and poverty status (for public schools). The high-poverty school sample in 2023 sampled fewer high-poverty schools than if a 50 percent threshold had been used because fewer students were enrolled in schools with a 76 percent threshold.
7 NCES definitions of locale were used. The four urban-centric locale types are (1) city, which consists of a large, midsize, or small territory inside an urbanized area and inside a principal city; (2) suburban, which consists of a large, midsize, or small territory outside a principal city and inside an urbanized area; (3) town, which consists of a fringe, distant, or remote territory inside an urban cluster; and (4) rural, which consists of a fringe Census-defined rural territory.
8 Since classrooms are sampled with equal probability within schools, small classrooms would have the same probability of selection as large classrooms. Selecting classrooms under these conditions would likely mean that student sample size would be reduced and some instability in the sampling weights created. To avoid these problems, pseudoclassrooms were created for the purposes of classroom sampling, in which small classrooms were joined to reach a larger student count. These pseudoclassrooms were treated as single classes in the class sampling process. Following class sampling, the pseudoclassroom combinations were dissolved and the small classes involved retained their own identities. In this way, data on students, teachers, and classroom practices were linked in small classes in the same way as in larger classes.
9 Self-weighting means that the sampling weights for all sampled students are identical.
TIMSS is an international collaborative effort involving representatives from every country participating in the studies. For TIMSS 2023, the test development effort began with a review and revision of the frameworks used to guide the construction of the assessment. The frameworks were updated to reflect changes in the curriculum and instruction of participating countries and education systems. U.S. and international experts in mathematics and science curriculum, education, and measurement and representatives from national educational centers around the world contributed to the final content of the frameworks. Maintaining the ability to measure change over time was an important factor in revising the frameworks.10
TIMSS 2023 included student assessments in science and mathematics and questionnaires completed by students, principals, and teachers. TIMSS 2023 marks the first fully digital assessment cycle. In the United States, students participated on Chromebook computers attached to a local server. Questionnaires for principals and teachers were self-administered, primarily using an online survey system.
The TIMSS student session at fourth grade and eighth grade included digital assessments in mathematics and science. Assessments were developed using the assessment frameworks, which were updated from the TIMSS 2019 frameworks using an iterative review process. The TIMSS & PIRLS International Study Center worked with the Science and Mathematics Item Review Committee and National Research Coordinators (NRCs) to update the frameworks based on changes in curriculum of participating countries and education systems. Framework revisions also maintained trend analyses by carrying forward some items from previous cycles.
To reduce burden on students and have the ability to cover a broad set of topic areas, mathematics and science items were assigned to blocks such that students did not encounter all items. At both grade levels, items were assembled into 28 blocks and these blocks were assigned across 14 booklets using a rotating block design. Each booklet contained two parts, one for each subject, and each part contained two blocks of items.
Students were randomly assigned to one of the 14 booklets and completed both mathematics and science items. Fourth-grade students had 72 minutes to complete the assessment, and eighth-grade students had 90 minutes for the assessment.
Students were given a tutorial at the beginning of the session with directions on navigational tools, examples of the types of questions included (e.g., multiple choice, drop-down), and available tools (e.g., calculator).
TIMSS 2023 implemented targeted testing to match student populations and assessment difficulty through a country-level adaptation of booklet rotations to improve efficiency and support student engagement. The TIMSS 2023 group adaptive design had three levels of item block difficulty—difficult, medium, and easy—that were combined into two levels of booklet difficulty. Each country administered the entire assessment, but the balance of more difficult and less difficult booklets varied with the mathematics and science achievement level of the students in the country. TIMSS 2023 aimed to improve the match between assessment difficulty and student ability in each country’s population by having a greater proportion of more difficult booklets in countries with relatively high achievement and a greater proportion of less difficult booklets in countries with relatively low achievement. Ability by country was estimated based on performance in prior TIMSS assessments, or in the field test for countries participating for the first time. The group adaptive design allowed students to reduce test fatigue while increasing engagement and motivation due to the more manageable testing experience. At the country level, it provided a more reliable estimation of student achievement. Overall, optimizing the assessment experience allowed TIMSS 2023 to balance the efficiency and comprehensiveness of the assessment, providing educators and policymakers with a more accurate understanding of student learning outcomes. More about group adaptive design in TIMSS 2023 can be found in Chapter 4 of the TIMSS 2023 Assessment Frameworks (https://timssandpirls.bc.edu/timss2023/frameworks/chapter-4.html). Accordingly, the new design maximized the information obtained from the assessment while limiting changes to the TIMSS assessment design.
More detail on the distribution of new and trend items for TIMSS 2023 are included in table 1
|
Grade and content domain |
All items |
Trend items |
New items |
|||
|---|---|---|---|---|---|---|
|
Number |
Percent |
Number |
Percent |
Number |
Percent |
|
|
Grade 4 |
||||||
|
Mathematics |
186 |
100 |
107 |
100 |
79 |
100 |
|
Number |
96 |
51 |
55 |
51 |
41 |
52 |
|
Measurement and Geometry |
50 |
27 |
29 |
27 |
21 |
27 |
|
Data |
40 |
22 |
23 |
22 |
17 |
22 |
|
Science |
176 |
100 |
97 |
100 |
79 |
100 |
|
Life Science |
79 |
45 |
39 |
39 |
40 |
52 |
|
Physical Science |
62 |
36 |
39 |
42 |
23 |
27 |
|
Earth Science |
35 |
19 |
19 |
18 |
16 |
20 |
|
Grade 8 |
||||||
|
Mathematics |
204 |
100 |
119 |
100 |
85 |
100 |
|
Number |
63 |
31 |
33 |
28 |
30 |
35 |
|
Algebra |
58 |
28 |
41 |
33 |
17 |
20 |
|
Geometry and Measurement |
42 |
22 |
24 |
22 |
18 |
21 |
|
Data and Probability |
41 |
20 |
21 |
17 |
20 |
24 |
|
Science |
220 |
100 |
127 |
100 |
93 |
100 |
|
Biology |
79 |
37 |
40 |
33 |
39 |
42 |
|
Chemistry |
43 |
20 |
29 |
23 |
14 |
15 |
|
Physics |
53 |
24 |
36 |
27 |
17 |
19 |
|
Earth Science |
45 |
20 |
22 |
17 |
23 |
25 |
NOTE: The percentages in this table represent the number of items and not the number of score points. Some constructed-response items are worth more than one score point. Details may not sum to 100 percent because of rounding.
SOURCE: International Association for the Evaluation of Educational Achievement (IEA), Trends in International Mathematics and Science Study (TIMSS), 2023.
Further details on the TIMSS 2023 assessment design can be found in the TIMSS 2023 Frameworks at https://timssandpirls.bc.edu/timss2023/frameworks/index.html (Mullis, Martin, and von Davier 2021). More details on the TIMSS 2023 item development process can be found in chapter 1 of the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods
TIMSS 2023 included questionnaires for principals, teachers, and students. TIMSS includes a home questionnaire for parents, but this component was not administered in the United States. All questionnaires were developed from the context questionnaire frameworks11 in an international collaborative process. Staff at the TIMSS & PIRLS International Study Center collaborated with the TIMSS 2023 Questionnaire Item Review Committee (QIRC) and NRCs. The QIRC comprises policy analysis experts from different countries. The questionnaires were field tested, and some items were revised before data collection.
Students completed a 30-minute computer-based questionnaire after completing the assessment. The questionnaire asked students about their school and home lives, including habits and homework, attitudes and beliefs about learning, and their lives both in and outside of school.
To provide contextual data on students, school principals and teachers completed questionnaires. Principals provided input on policy and budget responsibilities, curriculum and instruction issues, and student behavior, as well as descriptions of the organization of schools and courses. Teachers reported on their attitudes and beliefs about teaching and learning, teaching assignments, class size and organization, instructional practices, and participation in professional development activities. For eighth-grade students, a questionnaire was completed by both their mathematics and science teachers. All adult questionnaires were self-administered online.
In addition, TIMSS NRCs provided information on the national contexts for learning through the curriculum questionnaire and their country’s chapter in the TIMSS 2023 Encyclopedia (forthcoming).
The international versions of the TIMSS 2023 student, teacher, school, and home questionnaires are available at https://timssandpirls.bc.edu/timss2023/questionnaires.
The United States administers the student, teacher, school, and curriculum questionnaires but does not administer the home questionnaire. Several questions were adapted to be appropriate in the U.S. educational and cultural context, and several U.S.-specific questions, such as race/ethnicity on the student questionnaires, were added to the international versions of the questionnaires. The U.S. versions of these questionnaires are available at https://nces.ed.gov/timss/questionnaire.asp.
The source versions of assessments and questionnaires were provided in English by the TIMSS & PIRLS International Study Center and translated into the primary languages of instruction in each education system. For the U.S. instruments, some items were adapted for cultural context and U.S.-specific items were added to the questionnaires (e.g., questions about race/ethnicity). All adaptations and national items were reviewed and approved for use by IEA through a multistep process. Items were reviewed via national adaptations forms, and once approved, adaptations were applied to each item and then tested online. The goal of the translation and adaptation process is to ensure that the meaning and difficulty of items remain unchanged.
Further details on the translation and adaptation process can be found in chapter 5 of the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods.
10 Frameworks for the mathematics and science assessments are available in Mullis, Martin, and von Davier 2021 (https://timssandpirls.bc.edu/timss2023/frameworks/).
11 Frameworks for the context questionnaires are available at https://timssandpirls.bc.edu/timss2023/frameworks/. A supplemental special framework served as a supplement to the TIMSS 2023 Context Questionnaire Framework and provided information specific to some environmental topics, and is available at https://timssandpirls.bc.edu/timss2023/frameworks/pdf/T23-Environmental-Attitudes-and-Behaviors-Framework.pdf (Reynolds and Komakhidze 2022).
TIMSS 2023 emphasized the use of standardized procedures for all participants. Each participating education system collected its own data based on comprehensive manuals and training materials provided by the international project team to explain the study's implementation, including precise instructions for the work of school coordinators and scripts for test administrators to use in testing sessions.
The recruitment of schools required contacting schools in the sample to solicit their participation in TIMSS 2023. In most states, National Assessment of Educational Progress (NAEP) State Coordinators (NSCs) first obtained approval from the chief state school officer before contacting districts of sampled schools to obtain permission to contact schools. After contacting the district, the NSCs made the initial contact with school principals and let them know that the data collection contractor would contact them to coordinate the study at their school. Following similar procedures as those used by NSCs, the data collection contractors’ recruiters contacted private schools directly about the study as well as schools in states for which NSCs were unable to assist. If a school declined to participate, a substitute school designated by the prescribed international sampling procedure was recruited using the same protocols for district and school notification and recruitment.
Each participating school was asked to nominate a school coordinator (SC) as the main point of contact for the study. The SC worked with the data collection contractor’s recruiters to arrange logistics and liaise with staff, students, and parents as necessary. This included collecting class lists to randomly select classrooms within schools for participation, followed by the collection of student and teacher information to inform participation eligibility and identify accommodation needs.
On the advice of the school, parental permission for students to participate was sought with one of three approaches to parents: a simple notification; a notification with a refusal form; and a notification with a permission form for parents to sign. In each approach, parents were informed that their students could opt out of participating.
Schools, school coordinators, teachers, and students were provided with small gifts of appreciation for their willingness to participate. Generally, schools were offered $200, SCs received $100, teachers received $25, and students were given a pair of neon sunglasses. Certificates of community service were provided to students, and certificates of appreciation were provided to schools. Each participating school was also invited to have up to three staff attend a professional development webinar hosted by the data collection contractor. School staff could choose from one of three webinar topics: (1) engaging students in rigorous and relevant learning experiences; (2) building literacy through STEM; or (3) building resilient learning communities. Staff who attended the webinar received certificates of participation.
Test administration in the United States was conducted by staff hired by the data collection contractor on behalf of NCES and trained in TIMSS data collection procedures according to the TIMSS international guidelines. These test administrators (TAs) began working with each school 2 to 6 weeks before the scheduled test session, confirming logistics for the session, checking on permission form status, and scheduling a presession visit. At the presession visit, TAs inspected the session location; confirmed permission status, exclusions, and accommodations with the SC; and finalized details for the session day.
TIMSS student sessions were completed on Chromebooks with attached keyboards, shipped to the school in advance of the session. The TAs brought all necessary materials for the session, including additional electronic equipment such as the TA laptop and an access point. Rather than using school Wi-Fi, the TAs used the access point and TA laptop computer to establish a local area network and run the session. Each student Chromebook connected to the TA laptop via the local network, and any data entered into each student Chromebook was stored on the TA laptop. Collected student data were later uploaded via the Internet. Once the Chromebooks were set up and successfully connected to the access point, the TAs opened the TIMSS Player which loaded the assessment, and separately opened a URL to the student questionnaire on each Chromebook. The TA then assigned Chromebooks to students using the TIMSS login cards. TAs either logged in each student with his/her student ID and password for both the TIMSS Player and Student Questionnaire or provided students with their login cards so they could log themselves in. TIMSS login cards were placed on or near each Chromebook to help each student find their assigned workstation and provide the login information if the student needed to log in again. Scratch paper and a pencil were provided to students as well.
After completing each student session according to the procedures outlined in the TIMSS 2023 TA Manual, the TA completed the student tracking form and test administration form for that session and entered them into the data collection contractor’s Field Reporting System (FRS) within 24 hours of the session. TAs also entered notes in the FRS regarding response rates and any instrument issue or other extenuating circumstances that affected the session. TAs reported on the session to their field supervisor within 24 hours. Information from these conversations was used to alert the larger study team to reoccurring issues and create additional training points for TAs as needed. If the response rate was less than 90 percent, TAs followed up with the SC and attempted to complete a makeup session to increase the school’s response rate.
Test accommodations were provided by TAs or the school in the TIMSS session. The goal was to provide the test accommodations that students typically receive for their state assessment, local testing, or instructional practice in mathematics and science in accordance with international protocols and the student’s Individualized Education Plan or Section 504 plan. TAs reviewed each student selected for TIMSS 2023 with the school’s SC, and decisions were made for each student individually by the person most knowledgeable about how the student was tested on standardized assessments.
Students were excluded from the test session only if they had physical disabilities that prohibited them from completing the assessment, intellectual disabilities that exempted them from standardized testing, or less than 1 year of instruction in English.
All accommodations or exclusions were recorded by the TAs on classroom forms.
To ensure the quality of the TIMSS 2023 data, a national quality control program (conducted by the data collection contractor on behalf of NCES) and an international quality control program (conducted by IEA) were undertaken to document data collection activities. National Quality Control Monitors (NQCMs) and International Quality Control Monitors (IQCMs) made visits on the day of assessment to about 50 participating schools. A diverse set of schools was selected for observation, including representation from across the country and in both grades.
NRCs were asked to nominate one or more persons unconnected with their national center to serve as IQCMs for their country. The IEA and International Study Center trained all IQCMs on the required procedures for administering TIMSS, the responsibilities of the national centers in conducting the study, and their own roles and responsibilities.
All observers, for both the national and the international quality control monitoring, were instructed not to interfere in the assessment in any way. During each observation, the observer monitored all aspects of the inf-school session, including the equipment setup process, how students arrived and left the session, and how the TA responded to questions from the students. NQCM observers used a version of the NQCM Classroom Observation Record adapted for the United States. After the session, the observer conducted a short interview with the SC using the NQCM or IQCM materials. IQCMs reported back to IEA regarding levels of compliance with international standardized procedures for each country.
Chapter 9 of the TIMSS 2023 international Technical Report (Methods and Procedures) describes the success of participating education systems in meeting the international technical standards on data collection (see von Davier, Fishbein, and Kennedy 2024 at https://timss2023.org/methods). Information is provided for the fourth and eighth grades of all participating education systems on their coverage of the target population, school participation rates, student participation rates, and total number of schools and students.
The U.S. TIMSS 2023 fourth-grade weighted school participation rate before adding replacement schools was 63 percent and the weighted school participation rate after replacement was 82 percent. The fourth-grade classroom participation rate was 99 percent, and the student participation rate was 93 percent. The fourth-grade combined weighted school, classroom, and student participation rate with replacement schools was 76 percent, which met the Category 2 international participation requirement (See Defined Participation Rates for details on Categories). In the international report, results for the U.S. in the fourth-grade tables are annotated with a dagger, which means that participation rate guidelines were met after including replacement schools.
For the eighth grade, the U.S. weighted school participation rate before adding replacement schools was 55 percent and the weighted school participation rate after replacement was 72 percent. The eighth-grade classroom participation rate was 97 percent, and the student participation rate was 90 percent. The eighth-grade combined weighted school, classroom, and student participation rate with replacement schools was 63 percent, which met the Category 3 international participation requirement. The U.S. is annotated with a triple bar in the eighth-grade international report tables, indicating that while overall guidelines for sample participation rates were not met, the U.S. had a school participation rate of at least 50 percent before the use of replacement schools. Thus, U.S. results are included in the international report tables and in the TIMSS 2023 international average.
Participation rate tables including all countries can be found in Appendix B of the IEA’s TIMSS 2023 International Results in Mathematics and Science (von Davier et al. 2024), available at https://doi.org/10.6017/lse.tpisc.timss.rs6460
The TIMSS 2023 assessment items included both multiple-choice and constructed-response items. A scoring guide was created for each constructed-response item included in the TIMSS assessments. The scoring guides were carefully written and reviewed by NRCs of all participating education systems and other experts as part of the field test of items and were revised accordingly.
Each participating education system was responsible for the scoring of data for that participant, following established guidelines. The U.S. NRC is Lydia Malley of NCES. The NRC and additional staff from each education system attended scoring training sessions held by the TIMSS International Study Center. The training sessions focused on the scoring rubrics employed in TIMSS 2023. Participants in these training sessions were provided with extensive practice in scoring example items over several days. The information on the scoring reliability of TIMSS constructed-response items within each country (within-country reliability scoring), over time (trend reliability scoring), and across countries (cross-country reliability scoring) is presented in chapter 10 of the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods.
The NRC from each participating education system was responsible for their education system's data entry. Because the TIMSS 2023 data collection was entirely electronic, no data entry was required. The data were collected and uploaded to a secure server at the data collection contractor’s location where they were stored as data files in a common international format. The data collection contractor used the IEA-supplied data management software given to all participating education systems to conduct a set of data consistency checks and make corrections as needed.
The data were then sent to IEA’s data processing center in Hamburg (referred to as IEA Hamburg) for further review and cleaning. The main purpose of this cleaning was to ensure that all information in the database conformed to the internationally defined data structure. It also ensured that the national adaptations to questionnaires were reflected appropriately in codebooks and documentation and that all variables selected for international comparisons were comparable across education systems.
IEA Hamburg was responsible for checking the data files from each education system, applying standard cleaning rules to verify the accuracy and consistency of the data, and documenting electronically any deviations from the international file structure. Queries arising during this process were addressed to NRCs. In the United States, the NRC, along with the data collection contractor, reviewed the cleaning reports and data almanacs and provided IEA Hamburg with assistance on data cleaning.
With the assessment data, IEA Hamburg subsequently compiled background univariate statistics and preliminary test scores based on classical item analysis and item response theory (IRT). All education systems were provided their univariate and reliability statistics along with data almanacs containing international univariate and item statistics. This sharing allowed countries to review the statistics and data almanacs to ensure the data validity. Once any problems arising from this examination were resolved, sampling weights were produced and IRT-scaled student proficiency scores in mathematics and science were added to the file.
Detailed information on the entire data entry and cleaning process can be found in chapter 8 of the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods.
Before the data were analyzed, responses from the groups of students assessed were assigned sampling weights (as described in the next section) to ensure that their representation in the TIMSS 2023 results matched their actual percentage of the school population in the grade assessed. With these sampling weights in place, the analyses of TIMSS 2023 data proceeded in two phases: scaling and estimation. During the scaling phase, IRT procedures were used to estimate the measurement characteristics of each assessment question. During the estimation phase, the results of the scaling were used to produce estimates of student achievement. Subsequent conditioning procedures used the background variables collected by TIMSS to limit bias in the achievement results.
Responses from the groups of students were assigned sampling weights to adjust for over- or underrepresentation during the sampling of a particular group. The use of sampling weights is necessary to compute sound, nationally representative estimates. The weight assigned to a student's responses is the inverse of the probability that the student is selected for the sample. When responses are weighted, none are discarded, and each contributes to the results for the total number of students represented by the individual student assessed. Weighting also adjusts for various situations (such as school and student nonresponse) because data cannot be assumed to be randomly missing. The international weighting procedures do not include a poststratification adjustment. Weights for responding schools, for a particular grade, were adjusted so that the responding schools were representative of the types of schools that were sampled for that grade. Weights for responding students, for a particular grade, were adjusted so that responding students within a school were representative of all students enrolled in the same grade in the school. All TIMSS analyses are conducted using sampling weights including TIMSS 1995 through TIMSS 2023. A detailed description of this process is provided in chapter 3 of the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods.
In TIMSS, the propensity of students to answer questions correctly was estimated with
The scale scores assigned to each student were estimated using a procedure described in the Plausible Values section, with input from the IRT results.
With IRT, the difficulty of each item, or item category, is deduced using information about how likely it is for students to get some items correct (or to get a higher rating on a constructed-response item) versus other items. Once the parameters of each item are determined, the ability of each student can be estimated even when different students have been administered different items. This calibration of item parameters allowed for comparability of all students who each completed a subset of the assessment booklets. At this point in the estimation process, achievement scores are expressed in a standardized logit scale that ranges from -4 to +4. In order to make the scores more meaningful and to facilitate their interpretation, the scores for the first year (1995) were transformed to a scale with a mean of 500 and a standard deviation of 100. Subsequent waves of assessment are linked to this metric (as described below).
To make scores from the second (1999) wave of TIMSS data comparable to those from the first (1995) wave, two steps were necessary. First, the 1995 and 1999 data for countries and education systems that participated in both years were scaled together to estimate item parameters using common items administered in both the 1995 and 1999 assessments. Ability estimates for all students (those assessed in 1995 and those assessed in 1999) were then estimated based on the jointly calibrated item parameters. To put these jointly calibrated 1995 and 1999 scores on the 1995 metric, a linear transformation was applied so that the jointly calibrated 1995 scores had the same mean and standard deviation as the original 1995 scores. Such a transformation also preserves any differences in average scores between the 1995 and 1999 waves of assessment.
In order for scores resulting from subsequent waves of assessment (2003, 2007, 2011, 2015, 2019, and 2023) to be made comparable to 1995 scores (and to each other), the two steps above were applied sequentially for each pair of adjacent waves of data: two adjacent years of data were jointly scaled, then resulting ability estimates were linearly transformed so that the mean and standard deviation of the prior year was preserved. As a result, the transformed 2023 scores were comparable to all previous waves of the assessment and longitudinal comparisons between all waves of data are meaningful.
To facilitate the joint calibration of scores from adjacent years of assessment, common test items are included in successive administrations. This also enables the comparison of item parameters (difficulty and discrimination) across administrations. If item parameters change dramatically across administrations, they are dropped from the current assessment so that scales can be more accurately linked across years. In this way, even if the average ability levels of students in countries and education systems participating in TIMSS change over time, the scales still can be linked across administrations.
To keep student burden to a minimum, TIMSS purposefully administered a limited number of assessment items to each student—too few to produce accurate individual content-related scale scores for each student. The number of assessment items administered to each student, however, was sufficient to produce accurate group content-related scale scores for subgroups of the population. These scores were transformed during the scaling process into plausible values to characterize students’ possible performance in the assessment, given their background characteristics. Plausible values were imputed values and not test scores for individuals in the usual sense. If used individually, they provide potentially inaccurate estimates of the proficiencies of individual students. However, when grouped as intended, plausible values provide objective estimates of population characteristics (e.g., means and variances for groups).
Plausible values represented what the performance of an individual on the entire assessment might have been, had it been observed. They were estimated as random draws (usually five) from an empirically derived distribution of score values based on the student's observed responses to assessment items and on background variables. Each random draw from the distribution was considered a representative value from the distribution of potential scale scores for all students in the sample who had similar background characteristics and similar patterns of item responses. Differences between plausible values drawn for a single individual quantified the degree of error (the width of the spread) in the underlying distribution of possible scale scores that could have caused the observed performances.
An accessible treatment of the derivation and use of plausible values can be found in Beaton and González (1995). More detailed information can be found in the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods.
International benchmarks for achievement were developed to provide a concrete interpretation of what the scores on the TIMSS mathematics and science achievement scales mean (for example, what it means to have a scale score of 555 or 480).
To describe student performance at various points along the TIMSS mathematics and science achievement scales, TIMSS used scale anchoring. Scale anchoring involves selecting benchmarks (scale points) on the TIMSS achievement scales to be described in terms of student performance. The TIMSS scale has a range of 0 to 1,000 with student performance typically ranging between 300 and 700. TIMSS has identified four points along the achievement scales to use as international benchmarks of achievement—Advanced International Benchmark (625), High International Benchmark (550), Intermediate International Benchmark (475), and Low International Benchmark (400). Starting with TIMSS 2003, at each TIMSS cycle, a scale anchoring analysis was conducted to describe student competencies at these TIMSS international benchmarks. The analysis began with identifying items that students scoring at the anchor points (the international benchmarks) could answer correctly. The content of these items described what students at each benchmark level of achievement knew and could do. To interpret the content of anchored items, these items were grouped by content area within benchmarks and reviewed by mathematics and science experts. These experts focused on the content of each item and described the kind of knowledge, skill, or reasoning demonstrated by students answering the item correctly. The experts then provided a summary description of performance at each anchor point leading to a content-referenced interpretation of the achievement results.
IEA’s TIMSS 2023 International Results in Mathematics and Science (von Davier et al. 2024), available at https://doi.org/10.6017/lse.tpisc.timss.rs6460, summarizes what students who reached each of the TIMSS international benchmarks in 2023 could do in exhibit 1.1.3 for fourth-grade mathematics, exhibit 2.1.3 for fourth-grade science, exhibit 3.1.3 for eighth-grade mathematics, and exhibit 4.1.3 for eighth-grade science. A detailed description of the scale anchoring procedures is provided in chapter 15 of the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods.
As with any study, there are limitations to TIMSS data that researchers should consider. Estimates produced using data from TIMSS are subject to two types of error—nonsampling and sampling errors. Nonsampling errors can be due to errors made in collecting and processing data. Sampling errors can occur because the data were collected from a sample rather than a complete census of the population.
Nonsampling error is a term used to describe variations in the estimates that may be caused by population coverage limitations, nonresponse bias, measurement error, and data collection, processing, and reporting procedures. The sources of nonsampling errors are typically problems like the unit and item nonresponse, differences in respondents' interpretations of the meaning of the survey questions, response differences related to the particular time the survey was conducted, and mistakes in data preparation.
Missing data for background questions, administrative data, and assessment items were identified by separate missing data codes: omitted, not administered, and not applicable. The assessment items also included a missing code for not reached. An item was considered omitted if the respondent was expected to answer the item but no response was given (e.g., no box was checked in the item that asked, “Are you a girl or a boy?”). Items with invalid responses (e.g., multiple responses to a question calling for a single response) were also coded as omitted. The not administered code was used to identify items not administered to the student, teacher, or principal (e.g., those items excluded from the student's test booklet because of the booklet design, which rotates assessment blocks across booklets). An item was coded as not applicable when it was not logical for the respondent to answer the question (e.g., when the opportunity to make the response was dependent on a filter question). Finally, items that were not reached were identified by a string of consecutive items without responses continuing through to the end of the assessment.
The three key reporting variables identified in the TIMSS data for the United States—sex, race/ethnicity, and percentage of students in the school eligible for FRPL—all have low rates of missing responses. The response rates for these variables exceed the NCES standard of 85 percent and so can be reported without notation. For a complete explanation of any of these topics, see the TIMSS 2023 U.S. technical report (forthcoming).
Sampling errors arise when a sample of the population, rather than the whole population, is used to estimate some statistic. Different samples from the same population would likely produce somewhat different estimates of the statistic in question. This fact means that there is a degree of uncertainty associated with statistics estimated from a sample. This uncertainty is referred to as sampling variance and is usually expressed as the standard error of a statistic estimated from sample data. The approach used for calculating standard errors in TIMSS was jackknife repeated replication (JRR). Standard errors can be used as a measure for the precision expected from a particular sample. For estimates that are not plausible values, the estimates of standard errors were based entirely on sampling variance.
For statistics reporting student achievement, which are based on plausible values, standard errors were calculated based on two components. The first reflected the uncertainty due to sampling variance, which is described above. The second is known as imputation variance and reflected uncertainty because students’ achievement estimates were inferred from their observed performance on a subset of achievement items and other achievement-related information. This variance component reflected the posterior variance of the achievement variables given all available information used in the achievement imputation model described in chapter 11 of the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods.
Standard errors for all the reported estimates in the TIMSS 2023 U.S. results are included in downloadable Excel tables that are provided in the For More Information section at the bottom of each figure.
Although not presented in the report, confidence intervals provide another way to make inferences about population statistics in a manner that reflects the sampling error associated with the statistic. The intervals are calculated with a set confidence level, which defines the frequency that estimate will fall within the interval. All TIMSS significance tests presented in this report use a p value of .05, which is equivalent to a 95 percent confidence level. The confidence interval indicates a range for a given estimate if all possible samples were collected from the same population with similar design. The endpoints of a 95 percent confidence interval can be calculated from the sampled mean and standard error. The lowest endpoint of the interval equals the mean minus the product of 1.96 times the standard error, while the highest endpoint of the interval equals the mean plus the product of 1.96 times the standard error.
Details of estimating standard errors in the TIMSS 2023 results are provided in chapter 14 of the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods.
All TIMSS 2023 participants were assured that their data would be private. All participants’ privacy was protected throughout data collection. Data security and confidentiality were maintained throughout all phases of the study, including data collection, data creation, data dissemination, data analysis, and reporting. All students, teachers, and schools participating in TIMSS 2023 did so with the assurance that their identities would not be disclosed. All employees handling the data signed affidavits of data confidentiality. The names of schools, students, and teachers were removed from the TIMSS questionnaires and assessment booklets by the field staff, who either physically cut or blacked out the names on the corresponding TIMSS forms. Computer-generated school, student, and teacher IDs replaced school, student, and teacher names. Please refer to the TIMSS 2023 U.S. technical report (forthcoming) for more detailed information on confidentiality and disclosure procedures.
Potential disclosure can occur when the released data are compared against publicly available data collections that contain similar demographic information. Statistical disclosure control measures implemented on the TIMSS national data included identifying and masking potential disclosure risks for TIMSS schools and adding an additional measure of uncertainty of school, teacher, and student identification through random data swapping.12 All procedures were carefully conducted and reviewed by NCES to ensure the protection of respondent privacy while preserving the integrity of the data.
Because of increasing security concerns from the international community, an additional data confidentiality measure was implemented by IEA Hamburg for TIMSS 2023. School IDs were scrambled for each education system’s participating schools as a security measure for deidentification. This procedure puts an additional safeguard in place that makes it difficult to trace collected information back to the source. Because of the hierarchical ID naming convention employed by IEA Hamburg, the scrambling of the school IDs also affected the teacher and student IDs, thus making these IDs scrambled as well.
In accordance with NCES Standard 4-2, confidentiality analyses for the United States were implemented to provide reasonable assurance that public-use data files issued by IEA and NCES would minimize the risk of disclosure of individual U.S. schools, teachers, or students.
12 The NCES standards 4-2-1 through 4-2-12 (Seastrom 2014) (https://nces.ed.gov/statprog/2012/) provide the guidelines and methodology required to ensure data confidentiality for data dissemination. Perturbation disclosure limitation techniques were conducted to protect individually identifiable data. For public-use data files, NCES requires analysis and subsequent perturbations to be performed that minimize the possibility of a user matching outliers or unique cases on the file with external (or auxiliary) data sources. Because public-use files allow direct access to individual records, perturbation (such as random data swapping) and coarsening disclosure limitation techniques may both be required (Standard 4-2-8).
Comparisons made in the text of the TIMSS 2023 U.S. Results were tested for statistical significance. For example, in the commonly made comparison of education systems’ averages against the average of the United States, tests of statistical significance were used to establish whether the observed differences from the U.S. average were statistically significant. The estimation of the standard errors that was required in order to undertake the tests of significance was complicated by the complex sample and assessment designs, both of which generated error variance. Together they mandated a set of statistically complex procedures in order to estimate the correct standard errors. As a result, the estimated standard errors contained a sampling variance component estimated by the JRR procedure, and any estimates involving tests scores contained an additional imputation variance component arising from the scaling methodology. Details of estimating standard errors in the TIMSS 2023 results are provided in chapter 14 of the TIMSS 2023 international Technical Report (Methods and Procedures) (von Davier, Fishbein, and Kennedy 2024) at https://timss2023.org/methods.
In almost all instances, the tests for significance used were standard t tests. These fell into two categories according to the nature of the comparison being made: comparisons of independent samples and comparisons of nonindependent samples. Some background on the two types of comparisons follows with descriptions of the t tests used.
The variance of a difference is equal to the sum of the variances of the two initial variables minus two times the covariance between the two initial variables. A sampling distribution has the same characteristics as any distribution, except that units consist of sample estimates and not observations. Therefore,
The sampling variance of a difference is equal to the sum of the two initial sampling variances minus two times the covariance between the two sampling distributions on the estimates.
For example, to determine whether females’ performance differs from males' performance, a null hypothesis must be tested, as for all statistical analyses. In this particular example, it consists of computing the difference between the males' performance mean and the females' performance mean (or the inverse). The null hypothesis is
To test this null hypothesis, the standard error of this difference is computed and then compared to the observed difference. The respective standard errors on the mean estimate for males and females can be easily computed.
The expected value of the covariance is equal to 0 if the two sampled groups are independent. If the two groups are not independent, as is the case with males and females attending the same schools within an education system or when comparing an education system's mean with the international mean that includes that particular country, the expected value of the covariance might differ from 0.
In TIMSS 2019, participating education systems' samples were independent. Therefore, for any comparison between two education systems, the expected value of the covariance was equal to 0, and thus the standard error of the estimate was
with θ being a tested statistic.
Within a particular education system, any subsamples were considered as independent only if the categorical variable used to define the subsamples was used as an explicit stratification variable. Therefore, as for any computation of a standard error in TIMSS 2019, replication methods using the supplied replicate weights were used to estimate the standard error of a difference. Use of the replicate weights implicitly incorporated the covariance between the two estimates into the estimate of the standard error of the difference.
Thus, in simple comparisons of independent averages, such as the U.S. average with other education systems' averages, the following formula was used to compute the t statistic:
The est1 and est2 are the estimates being compared (e.g., average of education system A and the U.S. average), and se1 and se2 are the corresponding standard errors of these averages.
When comparing differences of nonindependent groups (e.g., when comparing the average scores of males with those of females within the United States), the following formula was used to compute the t statistic:
Estgrp1 and estgrp2 are the nonindependent group estimates being compared. Se(estgrp1 - estgrp2) is the standard error of the difference calculated using a JRR procedure, which accounts for any covariance between the estimates for the two nonindependent groups.
As required by the NCES statistical standards, all estimates presented in the figures and tables had their coefficient of variation calculated. The coefficient of variation is defined as the ratio of the standard error of an estimate to the estimate, expressed as a percentage. If the coefficient of variation exceeded 30 percent but was less than or equal to 50 percent, the estimate was flagged with an exclamation point (!), which indicates an unstable estimate that should be interpreted with caution. If the coefficient of variation exceeded 50 percent, the estimate was flagged with a pair of exclamation points (!!), which indicates an unstable estimate with a high coefficient of variation that should also be interpreted with caution.
In some cases, differences between scores, or gaps, were tested to understand whether the difference between the gaps was statistically significant. If a difference between two estimates was not measurably different based on a p value of < .05, then the difference between estimates was suppressed. This is flagged with a diamond (◊), which indicates that the differences should not be considered statistically significant.
If a cell within a data table represented a subgroup of fewer than 62 students, then there were not enough observations to report a reliable estimate for that data cell. A double dagger (‡) was instead listed within the cell to indicate that the value that did not meet reporting standards.
Within an education system, the mean 90th–10th percentile gap was calculated as follows:
where i equals the plausible value, n equals the maximum number of plausible values, and gapi,0 equals the 90th–10th percentile gap calculated for plausible value i and the student base weight. For TIMSS 2023, the value of n is equal to 5.
The standard error associated with the mean 90th–10th percentile gap was calculated as follows:
where
equals the sampling variance,
equals the measurement variance, and n equals the maximum number of plausible values. For TIMSS 2023, the value of n is equal to 5.
The formula for calculating the sampling variance is as follows:
where i equals the plausible value, n equals the maximum number of plausible values, j equals the replicate weight, m equals the maximum number of replicate weights, gapi,j equals the 90th–10th percentile gap calculated for plausible value i and student replicate weight j, and gapi,0 equals the 90th–10th percentile gap calculated for plausible value i and the student base weight. For TIMSS 2023, the value of n was equal to 5, and the value of m was equal to 150. The factor of ½ is included because TIMSS uses two primary sampling units per stratum. The formula for calculating the measurement variance is as follows:
where i equals the plausible value, n equals the maximum number of plausible values, gapi,0 equals the 90th–10th percentile gap calculated for plausible value i and student base weight, and
is the overall mean 90th–10th percentile gap. For TIMSS 2023, the value of n is equal to 5.
For each assessment (Grade 4 Mathematics, Grade 4 Science, Grade 8 Mathematics, and Grade 8 Science), a general outline of the process used to calculate the mean and standard error of the 10th and 90th percentile gap follows. An education system’s 10th and 90th percentile scores and their differences were calculated for each combination of assessment plausible value (5 in total) and student weight (151 in total, including the student weight and the 150 replicate weights). For each plausible value, the 90th—10th percentile difference calculated on the student weight was compared to the 90th—10th percentile difference calculated for each replicate weight, and the differences between the two values were squared and summed.
The five percentile differences calculated using the student weight for each of the assessment plausible values were averaged to get the overall mean percentile difference. An overall measure of imputed variance was calculated by summing the squared differences between the overall mean percentile difference and the five mean percentile differences used in its calculation and dividing that sum by 4. The reported standard error was then calculated by taking the square root of the sum of the sampling variance with the product of 1.2 times the imputed variance. The 1.2 value represents the ratio of the number of plausible values plus 1 over the number of plausible values ((5+1)/5 = 6/5 = 1.2).
In addition to the international response rate standards described in the prior section, the U.S. sample had to meet the NCES statistical standards. NCES statistical standards call for a nonresponse bias analysis to be conducted on a sample with a response rate below 85 percent. Since the U.S. TIMSS 2023 weighted school response rates before replacement were 62.6 percent and 55.4 percent for the fourth- and eighth-grade school samples, respectively, NCES required an investigation into the potential magnitude of nonresponse bias at the school level in the United States for both samples. Since the U.S. TIMSS 2023 weighted student response was 92.7 percent and 90.1 percent for the fourth- and eighth-grade student samples, respectively, a nonresponse bias analysis at the student level was not required.
This section provides a summary of the nonresponse bias analyses at the school level for the U.S. TIMSS 2023 samples at fourth and eighth grades. The analyses are fully explained in the TIMSS 2023 U.S. technical report (forthcoming).
Two nonresponse bias analyses were conducted, one for each grade. The general approach taken for each involved an analysis in three parts. The first compared the distribution of the participating original school sample with that of the total eligible original school sample. The second compared the distribution of the participating final school sample, which included participating substitutes that were used as replacements for nonresponding schools from the eligible original sample, to the total eligible final school sample. Schools were weighted by their size-adjusted13 school base weights in the first two analyses. In the third, the same sets of schools were compared as in the second analysis, but this time, when participating final schools were analyzed alone, schools were weighted using size-adjusted school base weights that were further adjusted with a nonresponse weighting adjustment.
The following categorical school frame characteristics were included in the analysis: school control (public and private), locale (city, suburban, town, and rural), census region (Northeast, Midwest, South, and West), and poverty level (high and low). Additionally, the following continuous school frame characteristics were included in the analysis: estimated number of fourth-grade or eighth-grade students enrolled, total number of students enrolled, percentage of students in seven race/ethnicity categories (American Indian or Alaska Native, non-Hispanic; Asian, non-Hispanic; Black, non-Hispanic; Hispanic; Native Hawaiian or Pacific Islander, non-Hispanic; White, non-Hispanic; and two or more races, non-Hispanic), and percentage of students eligible to participate in the FRPL program (available for public schools only).
Two types of analyses were conducted. First, a bivariate analysis compared frame characteristics for participating schools with the frame characteristics for the total eligible school sample. Second, logistic regression models were used to provide a multivariate analysis that examined the conditional independence of these school characteristics as predictors of participation. There were three models in the multivariate analysis. In the first model, each of the race/ethnicity categories were included in the model with “White, non-Hispanic” as the one omitted category. In the second model, the summed percentage of the race/ethnicity categories replaced the six race/ethnicity variables with “White, non-Hispanic” again as the one omitted category. In the third model, public schools were modeled separately with the summed race/ethnicity category percentage and addition of the percentage of students eligible for FRPL. The multivariate regression analysis could not be conducted after the school nonresponse adjustments were applied to the weights. The concept of nonresponse-adjusted weights does not apply to the nonresponding units, thus, an analysis could not be conducted to compare respondents with nonrespondents using nonresponse-adjusted weights.
For TIMSS Grade 4 original sample schools, seven characteristics were found to be statistically significantly related to participation in the bivariate analysis: school control; census region; percentage of American Indian or Alaska Native, non-Hispanic students; percentage of Asian, non-Hispanic students; percentage of Black, non-Hispanic students; percentage of Native Hawaiian or Pacific Islander, non-Hispanic students; and percentage of students eligible for FRPL. When all the characteristics were considered simultaneously in the first regression analysis, the following were statistically significant predictors of participation: a school being located in the Northeast; total school enrollment; percentage of American Indian or Alaska Native, non-Hispanic students; percentage of Asian, non-Hispanic students; percentage of Black, non-Hispanic students; and percentage of Native Hawaiian or Pacific Islander, non-Hispanic students. The specific findings were as follows:
The second model showed that being located in the Northeast or South were statistically significant predictors of school participation:
The third model showed that being located in the Northeast or South were statistically significant predictors of school participation among public schools only:
These results suggest that there was some potential for nonresponse bias in the TIMSS Grade 4 participating school sample based on the characteristics studied before replacement schools were included and nonresponse adjustments were applied.
For TIMSS Grade 4 final sample schools (with substitutes), seven characteristics were found to be statistically significantly related to participation in the bivariate analysis: school control; locale; census region; percentage of Asian, non-Hispanic students; percentage of Black, non-Hispanic students; percentage Native Hawaiian or Pacific Islander, non-Hispanic students; and percentage of students eligible for FRPL. When all the available factors were considered simultaneously in the first regression analysis, the following were statistically significant predictors of participation: a school being a private school; a school being located in the Northeast, Midwest, a city or a suburb; percentage of Asian, non-Hispanic students; and percentage of Native Hawaiian or Pacific Islander, non-Hispanic students. The specific findings were as follows:
The second model showed that being a private school or located in the Northeast, a city, or a suburb were statistically significant predictors of school participation:
The third model showed that being located in the Northeast, a city, or a suburb were statistically significant predictors of school participation among public schools only:
For TIMSS Grade 4 final sample schools with nonresponse adjustments applied to size-adjusted school base weights, no characteristics were found to be statistically significantly related to participation status in the bivariate analysis.
In sum, the fourth-grade school-level bias analysis indicates that bias estimated in the original sample, using available school characteristics, was greatly reduced using replacement schools and nonresponse weight adjustment. The extent to which the results of these bias analyses can be applied to analyses of survey or assessment items depends on the extent to which the items are correlated with the school characteristics used in these analyses.
For TIMSS Grade 8 original sample schools, six characteristics were found to be statistically significantly related to participation in the bivariate analysis: school control; locale; census region; percentage of Asian, non-Hispanic students; percentage of Native Hawaiian or Pacific Islander, non-Hispanic students; and percentage of students eligible for FRPL. When all the available factors were considered simultaneously in the first regression analysis, the following were statistically significant predictors of participation: a school being a private school, being located in the Northeast, total school enrollment, and percentage of Native Hawaiian or Pacific Islander, non-Hispanic students. The specific findings were as follows:
The second model showed that being a private school, being located in the Northeast or South, and total school enrollment were statistically significant predictors of school participation:
The third model showed that being located in the Northeast and total school enrollment were statistically significant predictors of school participation among public schools only:
These results suggest that there was some potential for nonresponse bias in the TIMSS Grade 8 participating original sample based on the characteristics studied before replacement schools were included and nonresponse adjustments were applied.
For TIMSS Grade 8 final sample schools (with substitutes), seven characteristics were found to be statistically significantly related to participation in the bivariate analysis: school control; locale; census region; percentage of Asian, non-Hispanic students; percentage of Black, non-Hispanic students; percentage of Native Hawaiian or Pacific Islander, non-Hispanic students; and percentage of students eligible for FRPL. When all the available factors were considered simultaneously in the first regression analysis, the following were statistically significant predictors of participation: a school being a private school; being located in the Northeast; poverty level; total school enrollment; percentage of Black, non-Hispanic students; percentage of Native Hawaiian or Pacific Islander, non-Hispanic students; and percentage of two or more races, non-Hispanic students. The specific findings were as follows:
The second model showed being a private school, being located in the Northeast, and total school enrollment were statistically significant predictors of school participation:
The third model showed that being located in the Northeast and total school enrollment were statistically significant predictors of school participation among public schools only.
For TIMSS Grade 8 final sample schools with nonresponse adjustments applied to size-adjusted school base weights, two characteristics were found to be statistically significantly related to participation in the bivariate analysis: census region and percentage of Native Hawaiian or Pacific Islander, non-Hispanic students. However, the magnitudes of the bias estimates and effect sizes were small.
In sum, the eighth-grade school-level bias analysis indicated that most bias estimated in the original sample using available school characteristics was greatly reduced using replacement schools and nonresponse weight adjustment. After nonresponse weight adjustment, participation status was statistically associated with census region and percentage of Native Hawaiian or Pacific Islander, non-Hispanic students; however, the effect sizes for these characteristics were small. The analyses for schools showed that, after application of a nonresponse weight adjustment, the responding schools were slightly more likely to be from the Midwest (22.1 percent for responding schools vs. 21.3 percent for eligible sampled schools) and Northeast (16.0 percent for responding schools vs. 15.7 percent for eligible sampled schools) census regions and slightly less likely to be from the South (37.1 percent for responding schools vs. 37.8 percent for eligible sampled schools) and West (24.8 percent for responding schools vs. 25.3 percent for eligible sampled schools) census regions. The percentage of Native Hawaiian or Pacific Islander, non-Hispanic students at participating schools was slightly lower than among the sampled schools (0.3 percent vs. 0.5 percent).
The extent to which the results of these bias analyses can be applied to analyses of survey or assessment items depends on the extent to which the items are correlated with the school characteristics used in these analyses.
13 A size-adjusted base weight equals the base weight times student enrollment in the relevant grade (fourth or eighth).
Beaton, A.E., and Gonzalez, E. (1995). The NAEP Primer. Chestnut Hill, MA: Boston College.
International Association for the Evaluation of Educational Achievement (IEA). (2022). IEA IDB Analyzer (Version 5.0). Hamburg, Germany: IEA Hamburg. Retrieved July 23, 2024, from https://www.iea.nl/data.
Mullis, I.V.S., Martin, M. O, & von Davier, M. (Eds.). (2021). TIMSS 2023 Assessment Frameworks. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. Retrieved May 31, 2024, from https://timssandpirls.bc.edu/timss2023/frameworks/index.html.
Reynolds, K.A., and Komakhidze, M. (2022). TIMSS 2023 Environmental Attitudes and Behaviors Framework. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College. Retrieved August 26, 2024, from https://timssandpirls.bc.edu/timss2023/frameworks/pdf/T23-Environmental-Attitudes-and-Behaviors-Framework.pdf.
Seastrom, M. (2014). NCES Statistical Standards (NCES 2014-097). U.S. Department of Education. Washington, DC: National Center for Education Statistics. Retrieved August 26, 2024, from https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2014097.
UNESCO. (2012). International Standard Classification of Education (ISCED) 2011. Montreal, Quebec: UNESCO Institute for Statistics. Retrieved May 29, 2024, from http://uis.unesco.org/sites/default/files/documents/international-standard-classification-of-education-isced-2011-en.pdf.
von Davier, M., Fishbein, B., & Kennedy, A. (Eds.). (2024). TIMSS 2023 Technical Report (Methods and Procedures). Chestnut Hill, MA: Boston College, TIMSS & PIRLS International Study Center. Retrieved October 8, 2024, from https://timss2023.org/methods.
von Davier, M., Kennedy, A., Reynolds, K., Fishbein, B., Khorramdel, L., Aldrich, C., Bookbinder, A., Bezirhan, U., & Yin, L. (2024). TIMSS 2023 International Results in Mathematics and Science. Chestnut Hill, MA: Boston College, TIMSS & PIRLS International Study Center. Retrieved October 8, 2024, from https://doi.org/10.6017/lse.tpisc.timss.rs6460.