Mark Schneider
Commissioner, National Center for Education Statistics

Benefits and Limitations of States Benchmarking to International Standards:
A Meeting to Assist States in Making Informed Decisions about Participating in International Assessments

May 30, 2008



This document provides a brief overview of the presentations and discussion at the National Center for Education Statistics (NCES) May 30, 2008, symposium on the benefits and limitations of states benchmarking to international standards. Several prominent experts in assessment and standards were asked to present and be available for questions during the discussion with national and state education policymakers (meeting participants). The symposium consisted of two sessions of formal presentations. NCES Commissioner Mark Schneider provided introductory remarks for both sessions. Each session concluded with a discussion among the audience and presenters. Tom Loveless (Brookings Institution) moderated the discussion of the first session. Institute of Education Sciences Director Grover J. "Russ" Whitehurst moderated the second session's discussion.

Synopses of each session's presentations, and the topics of discussion at the end of each session, are presented on the following pages of this document in the order in which they occurred. The formal presentations in each session were as follows:

Session 1: What Do International Assessments Measure?

Session 2: How Are the Data From International Assessments Used?

For further information about the symposium, contact Dan McGrath (Program Director, NCES International Activities Program) at

Session 1: What Do International Assessments Measure?

Introductory Remarks (Mark Schneider, NCES Commissioner)

To provide context for the session, Dr. Schneider reviewed past efforts to benchmark student performance, at the state level, to international standards. The TIMSS 1995 Benchmarking Study included five states and a consortium of school districts, and was funded by the states and districts involved. In 1999, the study was expanded to include 13 states and 14 districts or consortia of districts, with the additional costs underwritten by NCES and the National Science Foundation. In 2003, one state, and in 2007, two states participated: in these three cases, the states paid for the costs of administration. Some states have expressed to NCES an interest in participating in PISA 2009, although, as of mid-June 2008, none had entered into the contracts needed to undertake the assessment.

Dr. Schneider underscored that what the assessments measure-in terms of the knowledge and skills they emphasize, how they relate to states' curricula, and the ages and grade levels assessed-has important policy implications for states. PISA and TIMSS are very different from each other in what they measure. In turn, policy implications drawn from them may differ—and may differ from policy implications that would be drawn from the states' own assessments. Therefore, it is important for states to ask themselves, "Does this assessment measure something about which I care?"

What TIMSS and PIRLS Measure (Ina Mullis, Co-Project Director)

Benchmarking to International Standards - TIMSS & PIRLS: A Bridge to School Improvement MS PowerPoint (2,487 KB)

Dr. Mullis described TIMSS and PIRLS.



What PISA Measures (Ray Adams, Project Director)

What do international assessments measure: PISA MS PowerPoint (1,116 KB)

Dr. Adams presented on PISA. Key points included:

Comparing TIMSS and PISA to NAEP (Eugene Owen, NCES)

Comparing International Assessments to NAEP MS PowerPoint (452 KB)

Dr. Owen presented results from recent studies undertaken by NCES to compare TIMSS and PISA to NAEP in terms of their measurement frameworks, their relative emphases across content areas and cognitive skills, and the likelihood of the content/skills they assess being included in curricula in U.S. schools. Key points included:

TIMSS-NAEP 2007 comparisons

PISA-NAEP science comparisons

Aligning State Policies to International Assessment Standards (Sandy Kress, Akin Gump)

Aligning State Policies to International Assessment Standards MS Word (37 KB)

Mr. Kress discussed the relationship between standards and assessments and the importance of thinking strategically about standards, policy and practice, and the use of assessments as benchmarks. Key points included:

Session 1 Discussion

Moderated by Tom Loveless (Brookings Institution), the discussion at the end of the first session included the following main topics and points:

1. What international organizations are responsible for PISA, TIMSS, and PIRLS, and how are decisions about the assessments made?

Questions that were asked about these organizations led to a discussion of the governance arrangements for each of the assessments.

2. Separation of data collection/reporting and policy interpretation

Audience members and presenters debated the importance of the separation of data collection and interpretation. A concern raised by Tom Loveless, the chair and discussion moderator for session 1, was the extent to which the international bodies (the IEA and, particularly, the OECD) mix the reporting and policy implications of the results of the international assessments. In the United States, Office of Management and Budget guidelines call for a separation in time and space of the reporting of statistical results and the policy interpretation of those results. Strict separation of data collection/reporting and interpretation for policy implications helps to maintain the credibility of the statistical results. To the extent that the bodies representing international assessments mix the release of results with policy prescriptions drawn from an interpretation of the results, they risk undermining public confidence in the integrity of the data. This is especially true if the prescriptions are not based on rigorous scientific evidence.

3. What is the difference between a curriculum-based assessment, like TIMSS, and an assessment that measures the "yield" of learning, like PISA?

Questions raised about the concept of a "yield," and what exactly PISA tests if it is not curriculum based, led experts to explain that PISA focuses on the application of learned knowledge and skills to situations that students are expected to encounter as young adults. These situations are not tied to specific curricular objectives.

4. Importance of aligning states' standards to those of the top-performing nations

Several people echoed Mr. Kress's statements about the importance of setting rigorous standards. There was considerable discussion about how to encourage the development and implementation of rigorous standards across the states.

5. How is curricular information collected in the United States, and how can we learn more about the curriculum of other countries?

Several people called for collecting information about the curricula of the top-performing countries, and there was discussion about how curricular information is collected in the United States and internationally. The problem with focusing on only data from the top-performing countries was also noted—since low-performing countries may be employing the same practices as top-performing countries. TIMSS and PIRLS each ask countries to report curricular information, and the information is available through the International Study Center at Boston College ( or Reporting this information accurately for the United States is challenging. While NCES consults with national education groups (such as the Council for Chief State School Officers) when responding, curricular information is often reported as varying across states and districts.

6. What are the costs of administering international assessments in the states?

Because no states have participated in PISA in the past and relatively few have participated in TIMSS, it is difficult to estimate the costs. However, based on national costs and samples of 1,500 students per state, a broad estimate would be $1 million per state for PISA 2009. Questions were raised about why U.S. costs are apparently so high relative to costs in other countries, and why, in the United States, PISA has been much more expensive per student to implement than TIMSS. Costs of business for survey data collection in the United States are high relative to costs in other countries. Differences in costs between PISA and TIMSS within the United States are driven by the relatively higher recruitment, data collection, and scoring costs for PISA, as well as economies of scale resulting from the historically larger U.S. samples for TIMSS. NAEP is far less expensive per student than PISA or TIMSS, largely because of economies of scale.


Session 2: How Are the Data From International Assessments Used?

Introductory Remarks (Mark Schneider, NCES Commissioner)

Dr. Schneider described large-scale assessments such as TIMSS, PISA, and NAEP as instruments that perform a limited set of functions exceedingly well, but cautioned against putting them to uses for which they were not intended. Dr. Schneider raised three main points to be discussed in more detail by the presenters:

1) The international assessments do not provide usable scores at the student, classroom, school, or, in most cases, district level. Since they do not produce student scores and since they are cross-sectional, these assessments cannot measure individual student gains over time.

2) Cross-national measures of what matters in learning are still in their infancy. While we have made great progress in developing cross-national measures of knowledge and skills, key aspects of classrooms, schools, and education systems remain unmeasured or poorly measured.

3) International assessments are expensive. There may be alternate ways to generate state TIMSS and PISA scores that are less expensive and less burdensome. However, linking studies are based on statistical models that rely on a set of assumptions, and people have to be comfortable with the assumptions built into the model. Furthermore, results obtained from these models do not lend themselves to the kinds of analysis for which many researchers use TIMSS and PISA.

Analytic Possibilities and Data Limitations (Larry Hedges, Northwestern University)

Benchmarking with National and International Assessments MS PowerPoint (39 KB)

Dr. Hedges discussed the benefits and limitations of state benchmarking with PISA, TIMSS, and PIRLS. Key points included:



Other benchmarking possibilities

Alternatives to Empirical Benchmarking (Gary Phillips, American Institutes for Research)

Obtaining International Benchmarks for States Through Statistical Linking MS PowerPoint (83 KB)

Dr. Phillips discussed alternatives to benchmarking that is based on testing large numbers of students in each state. He emphasized methods that would allow states to use NAEP scores to estimate scores on international scales. Key points included:

Comments on the Hedges and Phillips presentations (Jack Buckley, Deputy Commissioner, NCES)

What Use Are International Assessments for States? MS PowerPoint (339 KB)

Dr. Buckley served as discussant for this session and provided comments on the Hedges and Phillips presentations. He also presented ongoing NCES work on the feasibility of using small area estimation methods for producing state-level estimates from national samples of international assessments. Key points included:

Session 2 Discussion

Moderated by Grover J. "Russ" Whitehurst (Institute of Education Sciences Director), the discussion at the end of the second session included the following main topics:

1. What are states interested in gaining from international assessments, especially as it relates to benchmarking?

Concerns about economic competition fundamentally drive the states' interest in international benchmarking: governors and business leaders are worried about how well our students and workforce measure up to those of our international competitors.

A new international assessment sponsored by the OECD, the Program for International Assessment of Adult Competencies (PIAAC), slated for data collection in 2011, should directly address concerns about the skills of adults. PIAAC will assess adults in literacy, numeracy, and problem solving in a technology-rich environment and provide benchmarks to most of the OECD countries.

2. Efforts needed to raise standards

Members of the audience asked whether state and national leaders would commit themselves to raising standards and agreeing on them. This would require better coordination on the part of the federal government and the states, as well as a discussion about what resources are needed to reach this goal. Many current high school exit exams are thought to test at a low level. There were calls for state leaders to design more appropriate tests and accept higher failure rates.

3. Seven questions for benchmarking internationally

There was considerable discussion around a set of seven questions that an audience member suggested we should be able to answer about the 10 top-performing countries (and perhaps the 10 bottom-performing countries):

  1. What do they want their kids to learn?
  2. What is their curriculum?
  3. How do they deliver it?
  4. How do they assess it?
  5. What is their cut score?
  6. How do they do against it?
  7. How well prepared are those who pass it (based on the standards in their country)?

The symposium concluded with Commissioner Schneider thanking the presenters and participants and inviting further discussion.