Skip Navigation




As described by Mislevy (1992) and Linn (1993), the central problems of "linking assessments" are determining the relationships between the evidence that two measures give about performance of interest and interpreting such evidence correctly. For the purposes of discussion, assume that there are two assessments, Assessment X and Assessment Y, and that the data produced by Assessment X can provide answers, properly qualified, to various questions involving student achievement. Further assume that there is a desire to "link Assessment Y to Assessment X." This means that one hopes to be able to answer these same questions, but using students' performance on Assessment Y. A specific example is linking the results of NAEP to the results of TIMSS to enable the prediction of state-level TIMSS means, based on state-level NAEP data.

How well linking will work and the necessary procedures to accomplish a link depend on how similar the two assessments are in terms of their goals, content coverage, and measurement properties. Mislevy and Linn defined four types of linking: equating, calibration, projection, and moderation. These are listed in decreasing order in terms of the assumptions required, with equating requiring the strongest assumptions and moderation the weakest. The ordering of the four types is also in decreasing order in terms of the strength of the link produced.

3.1 Equating

The strongest link occurs if the two assessments are built to the same specifications. Requirements include complete matches in content coverage, difficulty, type of questions used, mode of administration, test length, and measurement accuracy at each score point. Under such carefully controlled circumstances, the assessment results are essentially interchangeable and, by matching up score distributions, it is possible to construct a one-to-one correspondence table of scores on X and scores on Y so that any question that could be addressed using scores from Assessment X can be addressed in exactly the same way with transformed scores from Assessment Y, and vice versa. When equating is possible, it is because of the way the assessments were constructed, not simply because of the way the linking data were collected or the linking function constructed.

3.2 Calibration

A somewhat weaker kind of linking is possible if Assessment Y has been constructed to the same framework as Assessment X, but with different precision or level of difficulty. In this case, equating is not possible, but the results of the two assessments can be adjusted so that the expected score of a given student is the same on both assessments. As a consequence of different measurement characteristics in the X and Y data, the procedures needed to permit Y data to answer certain questions that could be addressed from X data will depend on the specific questions. Thus, Y data might be used to answer X data questions, but generally not by means of a single linking function as would be sufficient for assessments built to support equating.

3.3 Projection

A yet weaker linking obtains if the two assessments use different types of tasks, different administration conditions, or otherwise do not measure the same trait. Projection uses statistical methodology (often regression) to derive predictions from Y data about characteristics of the X distribution, in terms of a probability distribution for expectations about the possible outcomes. As the similarity between the two assessments decreases, the value of the Y data to answer X data questions also decreases and projections become increasingly sensitive to other sources of information. For example, the relationship between X and Y might vary across subpopulations of students and might change over time because of changes in policy or instruction.

3.4 Moderation

The weakest linking occurs when the two assessments are not assumed to be measuring the same construct, but scores that are comparable in some sense are still desired. Often, the two assessments are administered to nonoverlapping sets of students. Statistical moderation matches X and Y score distributions by simply applying the formulas of equating, while recognizing that the assessments have not been constructed to support equating. The procedures of statistical moderation can produce markedly different links among tests if carried out with different samples of students.

3.5 Linear Moderation Procedures

Because they are the only data available, a link between TIMSS and NAEP will be based on the reported results from the 1995 administration of TIMSS in the United States and the results from the 1996 NAEP. Since TIMSS and NAEP differ to varying degrees in terms of the assessment specifications, the numbers and kinds of tasks presented to students, and administration conditions, and since the TIMSS data and the NAEP data come from distinct administrations conducted one year apart, it is clear that the type of linking that can be accomplished will fall into the realm of statistical moderation.

The link between the two assessments will be established by applying formal equating procedures to match up characteristics of the score distributions for the two assessments. The next section establishes that linear moderation procedures (see, e.g., Peterson, Kolen, and Hoover 1989 who call it linear equating) provide an acceptable link between the two assessments. Linear moderation adjusts the distributions of the two assessments so that they have the same mean and standard deviation. This was the procedure used by Beaton and Gonzalez (1993) with the 1991 IAEP data for the U.S. sample of 13-year-olds and the 1990 NAEP data for public school students in grade 8 to express the 1991 IAEP results on the 1990 NAEP scale. This is also the procedure used by NAEP to link the results from the Trial State Assessment to those of the national assessment.

3.6 The Importance of Matching Content Coverage

Naturally, even though the type of link between NAEP and TIMSS has been relegated to the weakest category, moderation, there is the expectation that the two assessments are more or less measuring the same thing, so that it makes sense to assume that the linked results are supplying useful information. While this is hopefully true, the following example demonstrates the danger of conducting a linking study when no one student has taken forms of both assessments.

The two panels in Figure 1 give marginal distributions for pairs of tests to be linked. It is stressed that these are completely fictional data generated to make a point. The solid line in Panel A gives the frequency distribution of scores on a hypothetical test (Test A1) while the dashed line in Panel A gives the frequency distribution for another hypothetical test (Test A2). Consider the possibility of building a useful link between Test A1 and Test A2. Panel B provides hypothetical marginal frequency distributions for two other tests (Test B1 and Test B2) that are also to be linked together.


Note that the two distributions in Panel A are quite similar to each other. With no other information, one would be inclined to expect that the similarity of the marginal distributions for Test A1 and Test A2 indicates a pair of tests that are strongly related to each other and that would provide a strong and reliable linking—a linking accomplished through, for example, matching percentiles or moments of the two distributions. The pair of distributions shown in Panel B are much less similar. On the face of it, one might be much less confident about the possibilities of linking together Test B1 and Test B2.

However, consider Figure 2. Panel A of the figure gives a scatter plot of the scores on Test A1 versus the scores on Test A2. Clearly there is little relation between the scores on one test and the scores on the other—in fact, the between test correlation is 0.02. Tests A1 and A2 are examples of data that have roughly the same marginal distributions but are essentially unrelated to each other. An example might be two tests that have been scaled to have approximately normal marginal distributions but that measure roughly independent abilities (such as physical strength and mathematical achievement). On the other hand, Panel B shows a much tighter relationship (correlation of .98) between scores on Test B1 and Test B2. These two tests are examples of instruments that largely measure the same underlying construct. An example of this might be two mathematics tests built to the same content specifications that have somewhat different marginal distributions.

Figure 2—Scatter plots of the hypothetical test score pairs

Panel A
Test A1 versus Test A2
(Correlation = 0.02)
Panel B
Test B1 versus Test B2
(Correlation = 0.98)


In the current situation of linking NAEP and TIMSS, it is likely that the two assessments are, to a large extent, measuring comparable constructs. However, if the content match is not perfect, there could be potential problems with the linking.

Several recent studies have established instability of distributional matching procedures and have attributed at least some of the blame to differences in content coverage. For example, Ercikan (1997) used equipercentile equating procedures to link state level results from standardized tests (California Achievement Tests) published by CTB MacMillan/McGraw-Hill to the 1990 NAEP mathematics scale. Four states that participated in the 1990 Trial State Assessment of grade 8 mathematics were included in the study. Various links were established, including within-state linkings for each state and a linking using the combined data from all four states. Ideally, the results from all linkings should be identical, apart from sampling error. Instead, the results showed considerable divergence. In one case, two state-level linkings produced predicted NAEP scores, for the same standardized test score, which differed by 20 NAEP scale points, nearly two-thirds of a within-grade standard deviation on the NAEP scale. At least part of the problem was in terms of the content coverage—the various forms of the standardized tests covered around one-half of the NAEP objectives. As noted by Ercikan: "It is not surprising for CTB's tests to have a smaller set of objectives since NAEP is aimed at surveying a large set of skills and does not test every student on these skills, whereas these tests are used for student-level achievement testing."

Linn and Kiplinger (1994) also investigated the adequacy of distributional matching procedures for linking state-level test results to NAEP. Their study used four states that participated in the 1990 and 1992 grade 8 Trial State Assessments of mathematics. Equipercentile methods were used within each state to convert standardized test results to the NAEP scale using data from 1990. The resulting conversion tables were then used to convert the standardized test results from 1992 to estimated 1992 results for the state on NAEP. The predicted results were then compared to the actual 1992 NAEP results for the state. Additionally, separate equating functions were developed for male and female students for the two states where gender identification was available from the state test data. The results for the gender-based equatings showed differences larger than expected based on sampling error, being as large as one-third of a NAEP within-grade standard deviation of proficiencies. The differences between the estimated and actual 1992 results were small at the median for three of the four states but were larger, and more variable across states, for the lower and upper ends of the distribution. Results from content studies suggested that the content coverage of the NAEP and the statewide tests differ, and this discrepancy might produce some of the between-group and between-time instability of the equating functions. Accordingly, the equating studies were repeated, linking the statewide tests to the NAEP mathematics scale, Numbers and Operations, felt to be in closest match with the content coverage of the statewide tests. However, the between-group and across-time equating functions showed similar instabilities, even with a tighter content match. Linn and Kiplinger concluded that the displayed instability of the linking functions suggested that such linkings are not sufficiently trustworthy to use for other than rough approximations.

Recognizing the importance of overlap of content coverage on the quality of a link, the National Center for Education Statistics (NCES) commissioned a study on the similarity of coverage of the NAEP and TIMSS instruments. Appendix A contains a synopsis of the results of the report by McLaughlin, Dossey, and Stancavage (May 1997) on the content comparisons for mathematics and the report by McLaughlin, Raizen, and Stancavage (April 1997) on the content comparisons for science. Both reports conclude that the NAEP and TIMSS instruments both covered the same subareas of mathematics or science and were "generally sufficiently similar to warrant linkage for global comparisons—but not necessarily for detailed comparisons of areas of student achievement or processes in classrooms."