## 5. VARIANCE OF THE LINKING FUNCTION

If the means and standard deviations used to construct the linking function in Equation (1) were known without error, the transformed value could be used in the same manner as an observed value from the TIMSS assessment. That is, one could ignore the fact that was based on a transformation. Thus, for example, if x were the mean proficiency of some state on NAEP, the predicted mean proficiency for that state on the TIMSS would be , from Equation (1), and the variance of that predicted TIMSS mean proficiency would be simply

(2)

However, the means and standard deviations used to construct Equation (1) are based on sample data and hence are subject to various sources of variability. This implies that the linking function also is subject to variability and that the variance of given in Equation (2) is too small. There are (at least) four sources of variability that will affect the variance of the linking function. These include the following:

• Sampling. NAEP and TIMSS are presented to samples of students.
• Measurement error. Both assessments are subject to imprecision in the measurement of proficiencies for individual students.
• Model misspecification. The linking function might differ by demographic subgroup.
• Temporal shift. TIMSS was conducted in 1995 while NAEP was conducted in 1996.

Each of these components will be considered in turn. Prior to that, however, a general equation needs to be developed for the variance of in terms of the observed data. This equation will serve as a basis for the application of the various components of variance listed earlier.

Equation (1) expressed the linked value as a function of the statistics , determined from the U.S. samples, and of the term x, assumed to be a statistic determined from the TIMSS data from a sample different from the U.S. NAEP and TIMSS samples.

Since is a nonlinear function of the various means and standard deviations, a precise derivation of the variance is not practical. However, since both the NAEP and TIMSS samples are large, Taylor series linearization provides a convenient large sample approximation to the variance:

where the partial derivatives are evaluated x, respectively, the superscript T denotes matrix transpose, and is the matrix

where and where the covariances between x and are both zero since x is from a sample independent of those used to construct the estimates of .

Since

,

one has

(3)

Estimates of can be obtained by expressing in terms of and and applying the delta method to the result.

Let X = and å X X = . Since the mean and standard deviation from the NAEP sample is independent of those from the TIMSS sample, and since a sample mean and a sample standard deviation are independent assuming normality, å X X can be conveniently and credibly taken as a diagonal matrix with diagonal elements . As

and

some algebra produces

(4)

Since depends on x and Var(x), it is convenient to reexpress as

, (5)

where

Equation (4) and the equivalent Equation (5) form the basis of the variance estimate of . In the subsequent discussion, estimates of the successive components of due to sampling, measurement error, model misspecification, and temporal shift will be derived, accompanied by a comparison of how the standard error of a linked estimate changes.

As observed, the variance of depends on the value of x and the value of Var(x). For convenience, the comparisons of the components of will be for a typical value of Var(x), equal to the variance of the mean for the U.S. NAEP population. Additionally, two values of x will be used. The first, setting x equal to the U.S. overall NAEP mean, provides the smallest possible variance. The second, setting x equal to the 90th percentile of the NAEP proficiency distribution, provides an indication of how large could get. The specific values to be used are shown in Table 2.

Table 2.—Values of Var(x) and x used for comparing variances of the linked estimate for grade 8

 Subject Var(x) of U.S. mean x U.S. mean x U.S. 90th percentile Mathematics 1.1256 272.0 317.5 Science 0.7826 150.0 191.7

5.1 Component of Due to Sampling

Because both NAEP and TIMSS are samples, the estimates of the statistics are subject to sampling variability. Estimates of sampling variability quantify the stability of the sample-based statistics by estimating how much each statistic would likely change had it been based on a different, but equivalent, sample of students selected in the same manner as the achieved sample.

Traditional analysis procedures often assume that the observed data come from a simple random sample. That is, it is assumed that the observed values from different respondents are independent of each other and that these values are identically distributed. Such assumptions do not hold for data from complex sampling designs such as those used by NAEP and TIMSS. In fact, the complex sample designs of NAEP and TIMSS lead to variance estimates that are larger than the simple random sampling values.

Both assessments use the jackknife procedure (see, e.g., Johnson and Rust 1992) to estimate the variance due to sampling. The aim of the jackknife is to simulate the repeated drawing of samples of individuals according to the specified sample design. Once the various replicate samples are available, it is straightforward to compute the statistic of interest, t, on each sample and from these, obtain a variance estimate. Pairs of first-stage sampling units (FSSUs) are defined to model the sample design as one in which two first-stage units are drawn within each of a number of strata. The sampling variability of any statistic t is estimated as the sum of the components of variability that may be attributed to each of the FSSU pairs. The variance attributed to a particular pair of FSSUs is measured by recomputing the statistic of interest, t, on an altered sample. The ith altered sample is created by randomly designating the two members of the ith FSSU pair as the first and second respectively, eliminating the data from the first FSSU, and replacing the lost information with that from the second FSSU of the pair. The statistic of interest is then recomputed producing the pseudoreplicate estimate ti.

The component of sampling variability attributable to the ith pair of FSSUs is (ti-t)2. The estimated sample variance of the statistic t is the sum of these components across the M FSSU pairs2:

(6)

To estimate the sampling variance of the linking function, the jackknife procedure is applied to estimate the sampling variance for each of .3 These variance estimates are then plugged into the formula of Equation (4). The results are shown in Table 3, which gives the sampling variance values of the components of in Equation (5).

2 The variance of a statistic based on a stratified sample is the sum of the variances within each stratum, each multiplied by constants reflecting the degrees of freedom of the within-stratum variance and various weighting factors. There is no further division by degrees-of-freedom adjustments. In the case of NAEP and TIMSS, the paired FSSU estimates each have a single degree-of-freedom, and the jackknife estimates are derived so that the weighting factors are identical to 1. See Wolter (1985, Section 4.5) and Johnson (1989, pages 315-316, 321-322).

3 Following accepted practice, the jackknife variance estimates were based only on the first plausible value (see Mislevy, Johnson, and Muraki 1992).

Table 3.—Components of due to sampling for grade 8

 Subject Component: K0 K1 K2 Multiplies: 1 x x2 Mathematics 222.63 -1.4284 2.6263E-3 Science 120.28 -1.2096 4.0320E-3

Table 4 provides a comparison between the naive estimate of the variance of from Equation (2) and the current estimate, which also accounts for the effect of sampling, for the values of Var(x) and x given in Table 2. Column headed "Percentage increase" gives the amount by which the addition of the sampling component increases the variance estimate.

Table 4.—Comparison of the naive estimate of with the estimate including sampling error for grade 8

 x = U.S. Mean x= U.S. 90th Percentile Subject Naive Var Variance including sampling Percentage increase Naive Var Variance including sampling Percentage increase Mathematics 7.02 35.41 404% 7.02 40.85 482% Science 7.46 37.01 396% 7.46 44.03 490%

These results show that the inclusion of the sampling variability as a component of the variance of the linked estimate can substantially increase that variance estimate. The increases shown here are in accord with similar findings presented by Johnson, Mislevy, and Zwick (1990) who report a study where the traditional estimate of the standard error of a linked estimate of the mean underestimated by a factor of 1.6 a standard error that properly took the sampling variance into account.

5.2 Component of Due to Measurement Error

Both NAEP and TIMSS use IRT scaling models to summarize their data (see, e.g., Mislevy, Johnson, and Muraki 1992). IRT was developed in the context of measuring individual examinees' abilities. In that setting, each individual is administered enough items to permit a reasonably precise estimation of his or her ability, . Because the uncertainty associated with each is negligible, the distribution of , or the joint distribution of with other variables, can then be approximated using individuals' estimated abilities, , as if they were the true abilities. This approach breaks down in NAEP and TIMSS where each respondent is administered relatively few items in a scaling area. The problem is that the uncertainty associated with individual s is too large to ignore, and the features of the distribution can be seriously biased as estimates of the distribution (see Mislevy, Beaton, Kaplan, and Sheehan 1992). "Plausible values" were developed as a way to estimate key population features consistently.

The essential idea of plausible value methodology is to represent what the true proficiency of an individual might have been, had it been observed, with a small number of random draws from an empirically derived distribution of proficiency values that is conditional on the observed values of the assessment items and on background variables for each sampled student. These background variables are called conditioning variables. The random draws from the distribution can be considered to be representative values from the distribution of potential proficiencies for all students in the population with similar characteristics and identical patterns of item responses. The several draws from the distribution are different from each other in a way that quantifies the degree of precision in the underlying distribution of possible proficiencies that could have generated the observed performances on the items.

Both NAEP and TIMSS provide five sets of plausible values. Following Rubin (1987) the plausible values are regarded as five completed data sets, where the mth data set consists of all information about each student along with the mth plausible value for that student. Calculating a statistic, t, based on the mth plausible value across all students provides an estimate, t(m), of t. A better estimate of t is tM, the mean of the t(m).

The variance of tM consists of two components. The first component is the variance due to sampling subjects. There are five potential estimates of this variance, one for each plausible value, the mth estimated as the jackknife variance of t(m) according to Equation (6). While the best estimate of the sampling variance of tM is the average of the five jackknife estimates, due to the heavy computational requirement of computing five jackknife variances, the typical practice used by NAEP and TIMSS is to simply use the jackknife variance for the first plausible value. That practice will be followed in this report.

The second component of the variance of tM is that which is due to not observing . This component is added to the sampling component in Equation (6) and is estimated by

(7)

4 In its analysis, TIMSS essentially used a single conditioning variable, grade, within each country. NAEP used several hundred.

Table 5 gives the components of in Equation (5) attributable to measurement error. It can be seen that these components are an order of magnitude smaller than the equivalent components for sampling error shown in Table 3.

Table 5.—Components of due to measurement error for grade 8

 Subject Component: K0 K1 K2 Multiplies: 1 x x2 Mathematics 4.511 -0.3286 6.0419E-4 Science 2.304 -0.2977 9.9238E-4

Table 6 provides a comparison between the estimate of the variance of based on the naive estimate plus the term accounting for sampling error and the current estimate, which also accounts for the effect of measurement error. Included in the table is the percentage showing increase in the size of the naive variance that would have been obtained if the measurement error (but not the sampling error) was added to the variance. As in Table 4, the table uses the values of Var(x) and x from Table 2.

Table 6.—Comparison of the estimate of before and after including measurement error for grade 8

 Subject x = U.S. Mean x = U.S. 90th Percentile Naive plus sampling Plus measurement error Percentage increase over naive due to measurement error Naive plus sampling Plus measurement error Percentage increase over naive due to measurement error Mathematics 35.41 35.83 6% 40.85 42.52 24% Science 37.01 37.77 9% 44.03 46.46 32%

It can be seen that, while the measurement error provides a noticeable increase in the size of the naive variance estimate, the bulk of the overall variance is determined by the sampling error component.

5.3 Component of Due to Model Misspecification

As discussed earlier, statistical moderation can produce markedly different links if carried out with different samples of students. To be useful, the link between NAEP and TIMSS should be the same for various subpopulations. That is, the function linking TIMSS to NAEP should be the same for boys as it is for girls, for members of various ethnic categories, and for students in public and private schools. To the extent that the link is consistent across the subpopulations, there is increased confidence in the goodness of the link.

Tables 7A and 7B provide estimates of from Equation (1) for subpopulations defined by gender, selected race/ethnicity (black, Hispanic), and school type (public, private). In each case, the link was formed using data only from that subpopulation. The table also includes values of for the values of x equal to the U.S. mean and the 90th percentile, along with standard errors, computed from the subpopulation data, which include the naive, sampling, and measurement components of variance. Note that the values in the tables are somewhat biased due to the absence of conditioning variables related to these subgroups in the generation of plausible values from the TIMSS at grade 8. It is known (Mislevy, Beaton, Sheehan, and Kaplan 1992) that exclusion of conditioning variables leads to underestimation of differences between subgroup and overall means. Following Mislevy (1993), the bias in the subgroup estimate of is of the order of times the difference between the subgroup and overall NAEP means, where is the reliability of a form of the TIMSS instrument for the U.S. population, reported to be around .8 to .9. Nevertheless, these functions accurately reflect the reported TIMSS distributions for these subgroups.

 x = U.S. mean x = U.S. 90th percentile Subpopulation SE SE Total -180.13 2.498 499.4 6.0 613.1 6.5 Female -195.90 2.545 496.5 5.9 612.3 6.7 Male -168.15 2.466 502.6 6.7 614.8 7.4 Black -120.16 2.295 504.1 7.5 608.5 10.3 Hispanic -129.79 2.313 499.3 7.6 604.6 11.7 Private -256.63 2.702 478.3 12.8 601.3 14.2 Public -176.55 2.493 501.7 6.1 615.1 6.7

 x = U.S. mean x = U.S. 90th percentile Subpopulation SE SE Total 70.62 3.087 533.7 6.1 662.4 6.8 Female 77.07 3.024 530.6 7.0 656.7 7.8 Male 69.15 3.119 537.0 6.6 667.1 7.6 Black 78.06 3.138 548.8 8.1 679.7 12.0 Hispanic 101.55 2.893 535.5 7.6 656.2 10.4 Private 26.91 3.214 508.9 13.5 642.9 14.9 Public 72.44 3.096 536.8 6.3 665.9 7.0

On examining Tables 7A and 7B, some variability exists in the parameter estimates across subgroups, particularly for the intercepts, Additionally, the estimates of vary somewhat. However, the differences in between subgroups and between a subgroup and the total population is invariably nonsignificant. This nonsignificance would appear to sanction the use of the overall linking function for the subgroups examined here. Nevertheless, the issue of the consequence of variability of the linking function across subgroups will be explored.

In essence, variability of the linking function across subpopulations is an indication of model misspecification. That is, the linking function needs to include terms related to specific subpopulations. This was the approach adopted by Williams, et al., (1995) in their linking of NAEP to the North Carolina End of Grade (NC-EOG) mathematics test. In their study, they noted different relationships between the NC-EOG and NAEP by gender and race. These differences were accounted for through the use of a prediction equation that included intercepts and slopes for those groups. A similar approach was adopted by Bloxom, et al., (1995) in a linkage of scaled scores on the Armed Services Vocational Aptitude Battery (ASVAB) with NAEP.

However, both the NC-EOG and the ASVAB situations involved the construction of a linking function that would then be applied to individuals who are plausible members of the same population. That is, the NC-EOG to NAEP link was derived on a sample of North Carolina students for application in North Carolina—the ASVAB to NAEP link was based on a sample of the population to which the ASVAB is normally administered.

This is less clearly the case for the linking of NAEP to TIMSS, where the linking is performed on the combined U.S. population, but the results are to be applied to separate states. Instead, it is reasonable to view the instability of the linking function across subgroups as a potential component of variance of the linking function.

Suppose one has N subpopulations, which collectively constitute a partitioning of the population. For specificity, the 12 subpopulations formed by crossing gender by race/ethnicity (black, Hispanic, white+Asian+other) by school type (public, private) will be used. The selection of these specific subpopulations was made because they are key subgroups, and because the linking function could potentially differ across the subgroups.

For subpopulation s, suppose the linking function is

where are estimated solely from the data for subpopulation s. From Equation (3), one has

(8)

Notice that can be viewed as the conditional expectation of the linked estimate, conditional on membership in subpopulation s. Further, in Equation (8) is the conditional variance. To emphasize this conditional relation, write

where E denotes expectation and S stands for subpopulation. By standard probability theory, the following representation for the unconditional variance of occurs.

(9)

where ES and VarS denote the expectation and variance taken across subpopulations. The first term of Equation (9) is

, (10)

where, for example,

(11)

is the weighted average of the subpopulation values of , weighting by rfs , the relative frequency of subpopulation s in the whole population.

Approximating Equation (11) by , the value for the complete population, and performing similar substitutions for the remaining terms in Equation (10) means that Equation (10) can be approximated by Equation (3). Consequently, Equation (9) becomes

(12)

Thus, the variance of has acquired a second component, which measures instability (or mean-squared error) due to the variability of the linking function across subpopulations. The value of this component is

, (13)

where As and Bs are the population values of the intercept and slope for subpopulation s and are their averages across the subpopulations. An estimate of this component is

. (14)

Note that even if for all s, so that the variance component in Equation (13) is equal to zero, the estimate from Equation (14) will be nonzero simply because it is based on sample values. Consequently, a correction to the estimate must be applied. Normal theory with linear statistics gives the expectation of where N is the number of subpopulations, equal to 12 in this case, d/D is the ratio of the average design effect (defined below) within a subpopulation, D is the design effect for the whole population, and

(15)

with estimate that includes both the sampling and measurement error components.

The design effect measures the impact of complex sample data collection designs, such as used by NAEP and TIMSS, on the variance of a statistic. Specifically, the design effect is the ratio of the actual variance of the statistic, taking the data collection design into account, to the equivalent variance estimate obtained by ignoring the complex nature of the data caused by the sample design and by measurement error. Typically, the design effect is larger than 1. Additionally, it is possible that the design effects for subpopulations are smaller than those for the total population, implying that the ratio, d/D, could be smaller than 1. Experience based on NAEP, TIMSS, and other complex data sets suggests that the ratio could be as small as 0.5, implying that the multiplier for the expected value of the estimate of variance due to model misspecification could be as small as 5.

Table 8 gives the values of and for the values of Var(x) and x in Table 2. We see that in every case, is smaller than the factor 5, so that the estimate of the variance due to model misspecification is smaller than a reasonable estimate of its expected value . Furthermore, this implies that the variance estimate is much smaller than a critical value for, say, the 95% level of significance, which, for 5 degrees of freedom is about 11. This indicates that the variance estimate does not exceed the value to be expected due to sample and imputation variability under the hypothesis that the true component of Equation (13) is zero. Consequently, component due to model misspecification in the variance of the link is taken as zero.

Table 8.—Comparison of the component of variance due to model misspecification estimated by with its expected value estimated by for grade 8

 Subject x = U.S. mean x = U.S. 90th percentile Ratio Ratio Mathematics 101.3 28.8 3.5 82.1 35.5 2.3 Science 113.7 30.3 3.8 119.5 39.0 3.1

5.4 Component of Due to Temporal Shift

One disadvantage with using the actual TIMSS and NAEP data to construct a link is due to the fact that TIMSS and NAEP were administered in different years. Any procedure that attempts to link 1996 NAEP scores to 1995 TIMSS scores, based only on the 1995 TIMSS and the 1996 NAEP samples, will suffer from an unavoidable confounding of secular change—the within-instrument change in achievement over time—with effects due to differences between the instruments.

Estimation of the temporal effect of linking 1996 data to 1995 data is problematic, since any direct measure is lacking of the change in either NAEP or TIMSS measures of achievement between the 2 years. It is possible, by using related data (the NAEP long-term trend data from 1994 and 1996), to estimate the potential change in achievement as measured by NAEP between 1995 and 1996. As in every other case, it is impossible to estimate what the change in achievement would be in the TIMSS countries in 1996.

Adjustment for temporal trend would potentially adjust m^N of the linking function by a prediction of the difference between the NAEP mean in 1996 and what the mean would have been in 1995. This difference is estimated by

(16)

where and are the mean and standard deviation from the 1996 NAEP long-term trend assessment and and are the equivalent values from the 1994 long-term trend assessment. The second term in Equation (16) adjusts for the fact that the standard deviations for the main NAEP assessments differ from those for the long-term trend assessments. The square of Equation (16) is added to the variance of in the estimate of the variance of the linking function. Since the variance of is multiplied by in Equation (5), the value of this component of is D 2 and is constant for all x. The value of this component for the two subjects are shown in Table 9.

Table 9.—Value of the component of
due to temporal shift for grade 8

 Subject D 2 Mathematics 0.000 Science 0.942