As described by Mislevy (1992) and Linn (1993), the central problems of "linking assessments" are determining the relationships between
the evidence that two measures give about performance of interest
and interpreting such evidence correctly. For the purposes of
discussion, assume that there are two assessments, Assessment
X and Assessment Y, and that the data produced by Assessment X
can provide answers, properly qualified, to various questions
involving student achievement. Further assume that there is a
desire to "link Assessment Y to Assessment X." This means that
one hopes to be able to answer these same questions, but using
students' performance on Assessment Y. A specific example is linking
the results of NAEP to the results of TIMSS to enable the prediction
of state-level TIMSS means, based on state-level NAEP data.
How well linking will work and the necessary procedures to accomplish a link depend on how similar the two assessments are in terms of their goals, content coverage, and measurement properties. Mislevy and Linn defined four types of linking: equating, calibration, projection, and moderation. These are listed in decreasing order in terms of the assumptions required, with equating requiring the strongest assumptions and moderation the weakest. The ordering of the four types is also in decreasing order in terms of the strength of the link produced.
3.1 Equating
The strongest link occurs if the two assessments are built to
the same specifications. Requirements include complete matches
in content coverage, difficulty, type of questions used, mode
of administration, test length, and measurement accuracy at each
score point. Under such carefully controlled circumstances, the
assessment results are essentially interchangeable and, by matching
up score distributions, it is possible to construct a one-to-one
correspondence table of scores on X and scores on Y so that any
question that could be addressed using scores from Assessment
X can be addressed in exactly the same way with transformed scores
from Assessment Y, and vice versa. When equating is possible,
it is because of the way the assessments were constructed, not
simply because of the way the linking data were collected or the
linking function constructed.
3.2 Calibration
A somewhat weaker kind of linking is possible if Assessment Y
has been constructed to the same framework as Assessment X, but
with different precision or level of difficulty. In this case,
equating is not possible, but the results of the two assessments
can be adjusted so that the expected score of a given student
is the same on both assessments. As a consequence of different
measurement characteristics in the X and Y data, the procedures
needed to permit Y data to answer certain questions that could
be addressed from X data will depend on the specific questions.
Thus, Y data might be used to answer X data questions, but generally
not by means of a single linking function as would be sufficient
for assessments built to support equating.
3.3 Projection
A yet weaker linking obtains if the two assessments use different
types of tasks, different administration conditions, or otherwise
do not measure the same trait. Projection uses statistical methodology
(often regression) to derive predictions from Y data about characteristics
of the X distribution, in terms of a probability distribution
for expectations about the possible outcomes. As the similarity
between the two assessments decreases, the value of the Y data
to answer X data questions also decreases and projections become
increasingly sensitive to other sources of information. For example,
the relationship between X and Y might vary across subpopulations
of students and might change over time because of changes in policy
or instruction.
3.4 Moderation
The weakest linking occurs when the two assessments are not assumed
to be measuring the same construct, but scores that are comparable
in some sense are still desired. Often, the two assessments are
administered to nonoverlapping sets of students. Statistical moderation
matches X and Y score distributions by simply applying the formulas
of equating, while recognizing that the assessments have not been
constructed to support equating. The procedures of statistical
moderation can produce markedly different links among tests if
carried out with different samples of students.
3.5 Linear Moderation Procedures
Because they are the only data available, a link between TIMSS
and NAEP will be based on the reported results from the 1995 administration
of TIMSS in the United States and the results from the 1996 NAEP.
Since TIMSS and NAEP differ to varying degrees in terms of the
assessment specifications, the numbers and kinds of tasks presented
to students, and administration conditions, and since the TIMSS
data and the NAEP data come from distinct administrations conducted
one year apart, it is clear that the type of linking that can
be accomplished will fall into the realm of statistical moderation.
The link between the two assessments will be established by applying
formal equating procedures to match up characteristics of the
score distributions for the two assessments. The next section
establishes that linear moderation procedures (see, e.g., Peterson,
Kolen, and Hoover 1989 who call it linear equating) provide an
acceptable link between the two assessments. Linear moderation
adjusts the distributions of the two assessments so that they
have the same mean and standard deviation. This was the procedure
used by Beaton and Gonzalez (1993) with the 1991 IAEP data for
the U.S. sample of 13-year-olds and the 1990 NAEP data for public
school students in grade 8 to express the 1991 IAEP results on
the 1990 NAEP scale. This is also the procedure used by NAEP to
link the results from the Trial State Assessment to those of the
national assessment.
3.6 The Importance of Matching Content Coverage
Naturally, even though the type of link between NAEP and TIMSS
has been relegated to the weakest category, moderation, there
is the expectation that the two assessments are more or less measuring
the same thing, so that it makes sense to assume that the linked
results are supplying useful information. While this is hopefully
true, the following example demonstrates the danger of conducting
a linking study when no one student has taken forms of both assessments.
The two panels in Figure 1 give marginal distributions for pairs of tests to be linked.
It is stressed that these are completely fictional data generated
to make a point. The solid line in Panel A gives the frequency
distribution of scores on a hypothetical test (Test A1) while
the dashed line in Panel A gives the frequency distribution for
another hypothetical test (Test A2). Consider the possibility
of building a useful link between Test A1 and Test A2. Panel B
provides hypothetical marginal frequency distributions for two
other tests (Test B1 and Test B2) that are also to be linked together.

Note that the two distributions in Panel A are quite similar to
each other. With no other information, one would be inclined to
expect that the similarity of the marginal distributions for Test
A1 and Test A2 indicates a pair of tests that are strongly related
to each other and that would provide a strong and reliable linkinga
linking accomplished through, for example, matching percentiles
or moments of the two distributions. The pair of distributions
shown in Panel B are much less similar. On the face of it, one
might be much less confident about the possibilities of linking
together Test B1 and Test B2.
However, consider Figure 2. Panel A of the figure gives a scatter plot of the scores on Test
A1 versus the scores on Test A2. Clearly there is little relation
between the scores on one test and the scores on the otherin
fact, the between test correlation is 0.02. Tests A1 and A2 are
examples of data that have roughly the same marginal distributions
but are essentially unrelated to each other. An example might
be two tests that have been scaled to have approximately normal
marginal distributions but that measure roughly independent abilities
(such as physical strength and mathematical achievement). On the
other hand, Panel B shows a much tighter relationship (correlation
of .98) between scores on Test B1 and Test B2. These two tests
are examples of instruments that largely measure the same underlying
construct. An example of this might be two mathematics tests built
to the same content specifications that have somewhat different
marginal distributions.
Figure 2Scatter plots of the hypothetical test score pairs
Panel A
Test A1 versus Test A2
(Correlation = 0.02) |
Panel B
Test B1 versus Test B2
(Correlation = 0.98) |
 |
In the current situation of linking NAEP and TIMSS, it is likely
that the two assessments are, to a large extent, measuring comparable
constructs. However, if the content match is not perfect, there
could be potential problems with the linking.
Several recent studies have established instability of distributional
matching procedures and have attributed at least some of the blame
to differences in content coverage. For example, Ercikan (1997)
used equipercentile equating procedures to link state level results
from standardized tests (California Achievement Tests) published
by CTB MacMillan/McGraw-Hill to the 1990 NAEP mathematics scale.
Four states that participated in the 1990 Trial State Assessment
of grade 8 mathematics were included in the study. Various links
were established, including within-state linkings for each state
and a linking using the combined data from all four states. Ideally,
the results from all linkings should be identical, apart from
sampling error. Instead, the results showed considerable divergence.
In one case, two state-level linkings produced predicted NAEP
scores, for the same standardized test score, which differed by
20 NAEP scale points, nearly two-thirds of a within-grade standard
deviation on the NAEP scale. At least part of the problem was
in terms of the content coveragethe various forms of the standardized
tests covered around one-half of the NAEP objectives. As noted
by Ercikan: "It is not surprising for CTB's tests to have a smaller
set of objectives since NAEP is aimed at surveying a large set
of skills and does not test every student on these skills, whereas
these tests are used for student-level achievement testing."
Linn and Kiplinger (1994) also investigated the adequacy of distributional
matching procedures for linking state-level test results to NAEP.
Their study used four states that participated in the 1990 and
1992 grade 8 Trial State Assessments of mathematics. Equipercentile
methods were used within each state to convert standardized test
results to the NAEP scale using data from 1990. The resulting
conversion tables were then used to convert the standardized test
results from 1992 to estimated 1992 results for the state on NAEP.
The predicted results were then compared to the actual 1992 NAEP
results for the state. Additionally, separate equating functions
were developed for male and female students for the two states
where gender identification was available from the state test
data. The results for the gender-based equatings showed differences
larger than expected based on sampling error, being as large as
one-third of a NAEP within-grade standard deviation of proficiencies.
The differences between the estimated and actual 1992 results
were small at the median for three of the four states but were
larger, and more variable across states, for the lower and upper
ends of the distribution. Results from content studies suggested
that the content coverage of the NAEP and the statewide tests
differ, and this discrepancy might produce some of the between-group
and between-time instability of the equating functions. Accordingly,
the equating studies were repeated, linking the statewide tests
to the NAEP mathematics scale, Numbers and Operations, felt to
be in closest match with the content coverage of the statewide
tests. However, the between-group and across-time equating functions
showed similar instabilities, even with a tighter content match.
Linn and Kiplinger concluded that the displayed instability of
the linking functions suggested that such linkings are not sufficiently
trustworthy to use for other than rough approximations.
Recognizing the importance of overlap of content coverage on the
quality of a link, the National Center for Education Statistics
(NCES) commissioned a study on the similarity of coverage of the
NAEP and TIMSS instruments. Appendix A contains a synopsis of
the results of the report by McLaughlin, Dossey, and Stancavage
(May 1997) on the content comparisons for mathematics and the
report by McLaughlin, Raizen, and Stancavage (April 1997) on the
content comparisons for science. Both reports conclude that the
NAEP and TIMSS instruments both covered the same subareas of mathematics
or science and were "generally sufficiently similar to warrant
linkage for global comparisonsbut not necessarily for detailed
comparisons of areas of student achievement or processes in classrooms."