Are International Assessment Results Reliable, Valid, and Comparable?
Since the United States began participating in comparative international assessments in the 1960s, the number and scope of international assessments have grown. In addition, the quality of the data they collect has improved because of the international adoption of ever more rigorous technical standards and monitoring, along with growing expertise in the international community relating to assessment design (National Research Council 2002, p. 9). The international organizations that sponsor international student assessments—the OECD and the International Association for the Evaluation of Educational Achievement (IEA)—go to great lengths to ensure that their assessment results are reliable, valid, and comparable among participating countries. 12
For each study, the sponsoring international organization verifies that all participating countries select a nationally representative sample of schools and, from those schools, randomly select either classrooms of a particular grade or students of the particular age or grade targeted by the assessment. To ensure comparability, target grades or ages are clearly defined. For example, in TIMSS, at the upper grade level, countries are required to sample students in the grade that corresponds to the end of 8 years of formal schooling, providing that the mean age of the students at the time of testing is at least 13.5 years. Moreover, comparisons by age are carefully chosen to ensure that students at the target age are enrolled in school at comparable rates across countries. For example, PISA elected to study 15-year-old students because 15 is the oldest age at which enrollment rates remain around 90 percent or higher in most developed countries, including the United States (OECD 2008, table C2.1). For students 16 and older, attendance is not universally compulsory.
Not all selected schools and students choose to participate in the assessment; and certain students, such as some with mental or physical disabilities, may not be able to take the assessment. Thus the sponsoring international organizations check each country's participation rates (for schools and students) and exclusion rates (at the school level and within schools) to ensure they meet established target rates in order for the country's results to be reported.13
In addition to international requirements and verification to ensure valid samples, the sponsoring international organizations require compliance with standardized procedures for the preparation, administration, and scoring of assessments. Countries are required to send quality-control monitors to visit schools and scoring centers to report on compliance with the standardized procedures. Furthermore, independent international quality-control monitors visit a sample of schools in each country to ensure that the international standards are met.
Results for countries that fail to meet the required participation rates or other international requirements are footnoted with explanations of the specific failures (e.g., "only met guidelines for sample participation rates after substitute schools were included"), shown separately in the international reports (e.g., listed in a separate section at the bottom of a table), or omitted from the international reports and datasets (as happened to The Netherlands' PISA results in 2000, the United Kingdom's PISA results in 2003, and Morocco's TIMSS 2007 results at grade 8). For more details on international requirements, see appendix A.
Every participating country is involved in a thorough process of developing the assessment. The national representatives from each country review every test item to be included in the assessment to ensure that each item adheres to the internationally agreed-upon framework (the outline of the topics and skills that should be assessed in a particular subject area), and that each item is culturally appropriate for their country. Each country translates the assessment into their own language or languages, and external translation companies independently review each country's translations.
A "field test" (a small-scale, trial run of the assessment) is then conducted in the participating countries to see if any items were biased because of national, social, or cultural differences. Statistical analyses of the item data are also conducted to check for evidence of differences in student performance across countries that could indicate a linguistic or conceptual translation problem. Problematic items may be dropped from the final pool of items or scaled differently.
When this process is complete, the main assessment instruments are created. Each assessment "instrument" consists of the instructions, the same number of "blocks" of items (each block is a small set of selected items from the final pool of items), and a student background questionnaire. (Additional questionnaires are often prepared and administered to the students' teachers, parents, and/or school principal.) The instruments are then administered to the sampled students in each of the participating countries at comparable times.
For more details on the development and administration of the international assessments, see the Technical Reports produced for each assessment.
12 For complete details on the methods instituted to ensure data quality and comparability, see OECD 2008; Martin et al. 2007; and Olson, Martin, and Mullis 2008.
13 The United States also conducts its own nonresponse bias analysis if school participation rates are below 85 percent. For more details about nonresponse bias analysis, see appendix A.