NAEP-TIMSS Linking StudyAbout the Study

The 2011 NAEP-TIMSS linking study allowed NCES to evaluate multiple linking methodologies. This report contains predicted scores that are based on the statistical moderation approach for the 43 states that did not participate in TIMSS at the state level. The following sections provide a brief description of the linking study samples and methodologies. Details on the design employed in the study and the analyses conducted to evaluate the various methodologies are available in the NAEP-TIMSS Linking Study: Technical Report (NCES 2014-461).

Assessment Samples

To evaluate various linking methodologies, multiple samples of students were assessed during the NAEP testing window (January to March) as well as the TIMSS spring testing window (April to June).

Students assessed in NAEP mathematics or science during the 2011 NAEP testing window (2011 NAEP national sample)
Students assessed during the 2011 NAEP testing window with NAEP-like braided booklets containing both NAEP and TIMSS test questions (braided booklet samples in 2011 NAEP testing window)
Students in the United States assessed in TIMSS mathematics and science during the 2011 TIMSS testing window (2011 TIMSS U.S. national sample)
Students in the United States assessed during the 2011 TIMSS testing window with TIMSS-like braided booklets containing both NAEP and TIMSS test questions (braided booklet sample in 2011 TIMSS administration window)

All NAEP and TIMSS 2011 mathematics and science questions at grade 8 were included in the NAEP-like and TIMSS-like braided booklets.

Samples Assessed During NAEP Testing

In 2011, eighth-grade public school students from all 50 states, the District of Columbia, and the Department of Defense schools were sampled and participated in the NAEP mathematics and science assessments. The NAEP national samples were then composed of all the state samples of public schools students, as well as a national sample of private school students. A nationally representative sample of 175,200 eighth-graders from 7,610 schools participated in the NAEP mathematics assessment, and 122,000 eighth-graders from 7,290 schools participated in the NAEP science assessment.

Braided booklets—a set of special booklets containing one block of NAEP and one block of TIMSS test questions—were administered to a national public schools sample of randomly selected students—about 5,700 students from 3,710 schools for mathematics and 6,000 students from 3,760 schools for science.

Samples Assessed During TIMSS Testing

A total of 10,500 eighth-graders selected from randomly sampled classrooms in 500 U.S. schools participated in the TIMSS assessment. The TIMSS U.S. sample did not have a state component similar to NAEP.

In addition to the TIMSS U.S. national sample, nine U.S. states—Alabama, California, Colorado, Connecticut, Florida, Indiana, Massachusetts, Minnesota, and North Carolina—participated in TIMSS at the state level. Thus, these states were given the opportunity to compare the mathematics and science achievement of their students directly against the TIMSS education systems by receiving actual TIMSS scores. In the linking study, the nine states served as “validation states” where their actual TIMSS scores were used to check the accuracy of their predicted TIMSS scores. About 2,200 public school students from each of the nine validation states (or approximately 19,600 students) were selected to participate in the TIMSS assessment.

Furthermore, another set of braided booklets was administered to a nationally representative sample of 10,400 U.S. students from 510 schools. These braided booklets contained either one block of NAEP mathematics with two blocks of TIMSS mathematics and one block of TIMSS science, or one block of NAEP science with two blocks of TIMSS science and one block of TIMSS mathematics.

Accommodations and Exclusions

NAEP allows accommodations (e.g., extra testing time or individual rather than group administration) so more students with disabilities (SD) and English language learners (ELL) can participate in the assessment. This additional participation helps ensure that NAEP results accurately reflect the educational performance of all students in the target population. Exclusions in NAEP could occur at the school level, with entire schools being excluded. For the U.S. states that participated in the 2011 eighth-grade NAEP assessments, the exclusion rates ranged from 1 to 10 percent in mathematics and from 1 to 3 percent in science. For the nine states that also participated in 2011 TIMSS, the exclusion rates for NAEP participation ranged from 1 to 4 percent in mathematics and from 1 to 3 percent in science. The NAEP sampling frame excluded ungraded schools, special-education-only schools, and hospital schools, as well as schools serving prisons and juvenile correctional institutions.

Exclusions in TIMSS

Unlike NAEP, TIMSS does not provide testing accommodations for SD and ELL students. The International Association for the Evaluation of Educational Achievement (IEA), however, requires that the student exclusion rate not exceed more than 5 percent of the national desired target population (Foy, Joncas, and Zuhlke 2009).¹

Exclusions in TIMSS could occur at the school level, with entire schools being excluded, or within schools with specific students or entire classrooms excluded. Schools could be excluded that

are geographically inaccessible;
are of extremely small size;
offer a curriculum or school structure radically different from the mainstream educational system; or
provide instruction only to students in the excluded categories as defined under “within school exclusions,” such as schools for the blind.

Within the schools that are selected to participate, students may be excluded because of intellectual or functional disability, or the inability to read or speak the language(s) of the test (e.g., ELL students in the United States).

Seven percent of eighth-graders were excluded in the U.S. national sample of TIMSS 2011. Therefore, the U.S. results at grade 8 carry a coverage annotation for not meeting the IEA standard inclusion rate of 95 percent. Among the nine validation states, only three states—Alabama, Colorado, and Minnesota—met the IEA inclusion rate standard. See information on the TIMSS exclusion rates in the nine U.S. states and the education systems that participated in the 2011 TIMSS assessment. It should be noted that there is one exclusion rate for each validation state or education system in TIMSS because the same sampled students were assessed in both mathematics and science.

Linking Methodologies

The process by which NAEP results are reported on the TIMSS scale is referred to as statistical linking. The design of the 2011 study allowed for the use of several different linking methods: statistical moderation, statistical projection, and calibration to predict TIMSS results for the U.S. states that participated in NAEP.

Statistical moderation aligns score distributions such that scores on one assessment are adjusted to match certain characteristics of the score distribution on the other assessment. In this study, moderation linking was accomplished by adjusting NAEP scores so that the adjusted score distribution for the public school students who participated in 2011 NAEP had the same mean and variance as the score distribution for public school students in the TIMSS U.S. national sample. This allowed NAEP results to be reported on the TIMSS scale.

Neither NAEP nor TIMSS provides student level scores. Rather, both assessments provide five plausible values for individual students, each resulting in unbiased estimates of the mean and the standard deviation of the proficiency distribution overall and of the student groups. For this reason, moderation linking function parameters were estimated five times by pairing one set of estimates of the NAEP mean and standard deviation with one set of estimates of the TIMSS mean and standard deviation. The final values of the moderation linking function parameter estimates were the average of the five values. To predict the mean TIMSS scores and the percentages of students reaching each TIMSS benchmark (Advanced, High, Intermediate, and Low) for each state, the moderation linking function was applied to individual state NAEP score distributions. The moderation method did not assume that the two assessments measured exactly the same construct. However, the linking results were dependent upon having two samples—one from each assessment—to align the score distributions. Thus, the more NAEP and TIMSS vary in content, format, or context, the more likely the moderation-based linking results would differ markedly if statistical moderation were carried out with different samples of students.

Statistical projection involves developing a function to project performance on one assessment based on the performance on the other assessment. In this study, the samples of students who were assessed with the braided booklets (i.e., the samples that responded to both NAEP and TIMSS questions) were used to determine the prediction function. Two separate prediction functions were developed for each subject—one using the braided booklet sample assessed during the NAEP testing window and one using the braided booklet sample assessed during the TIMSS testing window. The projection function from the NAEP window braided sample was used to compare results among the three linking methods examined in the study. Similar to the statistical moderation method, the statistical projection method did not assume that the two assessments to be linked measured exactly the same construct.

Calibration linking, as discussed in Kolen and Brennan (2004, page 430)², is a type of linking used when the two assessments are based on

the same framework but different test specifications and different statistical characteristics, or
different frameworks and different test specifications, but the frameworks are viewed as sharing common features and/or uses.

In this study, calibration was accomplished by applying the item-response theory method to calibrate NAEP items directly onto the TIMSS score scale that was established using students’ responses to TIMSS items. Data collected from the NAEP sample, TIMSS sample, and the two braided booklet samples were all used in the calibration linking. With NAEP items calibrated onto the TIMSS scale, it was possible to predict TIMSS scores for students who took only NAEP items.

The three linking methods discussed above were all applied to predict likely TIMSS scores for each of the states based on their NAEP results. For each linking method, the accuracy of the predicted TIMSS scores was evaluated by comparing predicted TIMSS results to the actual results for the nine 2011 validation states and results for national student groups (gender and race/ethnicity) as well. All three linking methods yielded comparable predicted state TIMSS results and the national TIMSS results by student groups. The difference between predicted and actual TIMSS results was not statistically significant for any of the national gender or racial/ethnic groups across all linking methods.

Once it was determined that all three methods of linking yielded essentially the same results, it was decided that one method should be chosen to provide estimates for this report. Statistical moderation was selected by NCES because it was the simplest method requiring the estimation of the fewest parameters (i.e., the means and standard deviations of the U.S. national public school samples for NAEP and TIMSS). The method was also applied to the extant national samples of NAEP and TIMSS and did not require the use of the braided booklet samples that were required for the calibration and projection methods of linking. This means NCES has the option of conducting future NAEP-TIMSS linking studies using statistical moderation without the time and expense of braided booklet samples.

However, for the validation states, some differences were observed between their linkage-based predicted TIMSS scores and their actual TIMSS scores. To reduce the observed differences, a two-stage adjustment procedure was applied in addition to the statistical moderation linking parameters.

The first stage of the procedure was intended to adjust the predicted TIMSS means for all states to account for differences in population coverage between the NAEP and TIMSS state samples that resulted from the two programs’ different exclusion and accommodations policies. Each state’s NAEP accommodation rate was used to adjust the predicted state TIMSS mean closer to what might have been observed if the NAEP target population was more similar to that of TIMSS. The adjustment function was a linear regression function derived from the nine validation states that participated in both NAEP and TIMSS at the state level. The same adjustment function was then applied to those states where the NAEP accommodation rate was available.

In the second stage, a function was derived to model the relationship between the actual TIMSS scores for the nine validation states and their predicted TIMSS scores after the adjustment for NAEP accommodation rates. This function was used as the second adjustment factor that was applied to all states’ predicted TIMSS means.

The predicted state TIMSS results presented here are, therefore, estimated from the statistical moderation linking that incorporated the two-stage adjustment procedure.

Interpreting Statistical Significance

Comparisons between predicted state results from the 2011 NAEP-TIMSS linking study and education systems (that have actual TIMSS scores) consider both the size of the differences and the standard errors of the two statistics being compared. The size of the standard errors is influenced by many factors, such as the degree of uncertainty associated with statistics estimated from a sample, and the degree of uncertainty related to the linking function. There were other sources of error associated with the predicted TIMSS scores that were not taken into account. These include the uncertainty associated with the adjustment function derived in the first stage of the two-stage adjustment procedure to account for the differences in exclusion and accommodation rates between NAEP and TIMSS.

When an estimate has a large standard error, a numerical difference that seems large may not be statistically significant. Differences of the same magnitude may or may not be statistically significant depending upon the size of the standard errors of the estimates. Only statistically significant differences (at a level of.05) are discussed as higher or lower in this report. No statistical adjustments to account for multiple comparisons were used.

¹ Foy, P., Joncas, M., and Zuhlke, O. (2009). TIMSS 2011 School Sampling Manual. Unpublished manuscript, Chestnut Hill, MA: Boston College.

²Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking. New York, NY: Springer.

Last updated 07 October 2014 (RF)