Skip Navigation

Table of Contents  |  Search Technical Documentation  |  References

Potential Bias in Analysis Results Using Variables Not Included in the Model

An essential part of the population characteristics estimation methodology of NAEP is that consistent, unbiased mean and dispersion proficiency estimates are computed for populations identified by the variables in the matrix. However, statistics based on variables not included in the conditioning model are subject to asymptotic (secondary) biases. The magnitude of this bias depends on several factors:

  • type of statistic;
  • amount of cognitive information (number of items responded to) available per student on a (sub)scale;
  • strength of the relationships between subscales in multivariate contexts; and
  • background variable of interest that is not in the conditioning model and the strength of the relationship between this variable and
    • other background variables that were included in the population-structure model, and
    • proficiency.

The bias typically results in an underestimate of the effect of the variables not included in the population-structure model. For details and derivations see Beaton and Johnson (1990), Mislevy (1991), and Mislevy and Sheehan (1987). If a large amount of cognitive information is available per student, then the model depends little on the population structure and biases will be small. If the current set of variables in the model represents the variability of the new variable that was not included in the model, then biases are relatively small as well. However, this is not true for all types of analyses. In particular:

  • High shared variance between background variables in the model and those not in the model mitigates biases in analyses that involve only scale scores and variables not in the model, such as marginal means or regressions.

  • High shared variance exacerbates biases in regression coefficients of conditional effects for variables not in the model, when background variables in the model and those not in the model are analyzed jointly as in multiple regression.

The use of plausible values and the large number of background variables that are included in the population-structure models in NAEP allow a large number of secondary analyses to be carried out with little or no bias, and mitigate biases in analyses of the marginal distributions of θ in variables not in the model. Analysis of the 1988 NAEP reading data (some results of which are summarized in Mislevy 1991), which had fewer variables than most current population-structure models in NAEP, indicates that the potential bias for variables not in the model in multiple regression analyses is below 10 percent, and biases in simple regression of such variables is below 5 percent. Additional research (summarized in Mislevy 1990) indicates that most of the bias reduction that was obtained by using a large number of variables in the models can be captured instead by using the first several principal components of the matrix of all original variables in the model. This procedure was first adopted for the 1992 national main assessments by replacing the variables that define group membership by the first K principal components, where K was selected so that 90 percent of the total variance of the full set of the variables (after standardization) was captured. Mislevy (1990) shows that this puts an upper bound of 10 percent on the average bias for all analyses involving the original group membership variables.

Last updated 06 November 2008 (GF)

Printer-friendly Version