Skip Navigation
PEDAR: Research Methodology  First-Generation Students in Postsecondary Education: A Look at Their College Transcripts
The Naitonal Education Longitudinal Study of 1988
The NELS:88 Postsecondary Education Transcript Study
Analysis Sample and Weights
Accuracy of Estimates
Data Analysis System
Statistical Procedures
Differences Between Means
Linear Trends
Multivariate Commonality Analysis
Executive Summary
Full Report (PDF)
Executive Summary (PDF)
 Statistical Procedures: Multivariate Commonality Analysis

There are many ways for members of the public and other researchers to make use of NCES results. The most popular way is to read the written reports. (Other ways include obtaining and analyzing public use and restricted use data files. These allow researchers to carry out and publish their own secondary analyses of NCES data.)

It is very important when reading NCES reports to remember that they are descriptive in nature. That is, they are limited to describing some aspect of the condition of education. These results are usefully viewed as suggesting various ideas to be further examined in light of other data, including state and local data, and in the context of the large research literature elaborating on the many factors predicting and contributing to educational achievement or to other outcome variables of interest.

However, some readers are tempted to make unwarranted causal inferences from simple cross tabulations. It is never the case that a simple cross tabulation of any variable with a measure of educational achievement is conclusive proof that differences in that variable are a cause of differential educational achievement or that differences in that variable explain any other outcome variable. The old adage that “correlation is not causation” is a wise precaution to keep in mind when considering the results of NCES reports. Experienced researchers are aware of the design limitations of many NCES data collections. They routinely formulate multiple hypotheses that take these limitations into account and readers of this volume are encouraged to do likewise. As part of the Institute of Education Sciences, NCES has a responsibility to try to discourage misleading inferences from the data presented and to educate the public on the genuine difficulty of making valid causal inferences in a field as complex as education. Our reports are carefully worded to achieve this end.

This focus on description, eschewing causal analysis, extends to multivariate analyses as well as bivariate ones. Some NCES reports go beyond presenting simple crosstabulations and present results from multiple regression equations that include many different independent (“predictor”) variables. This can be useful to the reader, especially those without the time or training to access the data on their own. Because many of the independent variables included in descriptive reports are related to each other and to the outcome they are predicting, a multivariate approach can help users to understand their interrelation. For example, students’ generation status and delayed enrollment are associated with each other and are each predictors of bachelor’s degree attainment. What happens to the relationship between students’ generation status and bachelor’s degree attainment when delayed enrollment differences are accounted for? This question cannot be answered using bivariate techniques alone.

One way of answering the question is to create three variable tabulations. This method is sometimes used in NCES reports. When the number of independent variables increases to four or more, however, the number of cases in individual cells of such a table often becomes too small to find significant differences simply because there are too few cases to achieve statistical significance. To make economical use of the many available independent variables in the same data display, other statistical methods must be used that can take multiple independent variables into account simultaneously.

Multiple linear regression is often used for this purpose: to adjust for the common variation among a list of independent variables.7 This approach is sometimes referred to as commonality analysis,8 because it identifies lingering relationships after adjustment for “common” variation. This method is used simply to confirm statistically significant associations observed in the bivariate analysis while taking into account the interrelationship of the independent variables.

Thus, this multiple regression approach is descriptive. Significant coefficients reported in the regression tables indicate that when the variable is deleted from (or added to) the set of independent variables, it results in a non-zero change in R-squared, which is the basis of the commonality analysis. In other words, a significant coefficient means that the independent variable has a relationship with the outcome variable that is unique, or distinct from its relationship with other independent variables in the model.

Multivariate description of this sort is distinct from either a modeling approach in which an analyst attempts to identify the smallest relevant set of causal or explanatory independent variables associated with the dependent variable or variables or an approach using one of the many varieties of structural equation modeling. In contrast, a multivariate descriptive or commonality approach provides a richer understanding of the data without needing to make any kind of causal assumptions, which is why descriptive multivariate commonality analysis is often employed in NCES statistical reports.

When should commonality analysis be employed? It should be used in statistical analysis reports when independent variables are correlated with both the outcome variable and with each other. This will allow the analyst to determine how much of the effect of one independent variable is due to the influence of other independent variables, since in a multiple regression procedure these effects are adjusted for. For example, since the strength of the statistical relationship between students’ generation status and bachelor’s degree attainment may be affected by time of enrollment, computing a multiple regression equation that contains both variables allows the analyst to determine how much if any difference in bachelor’s degree attainment between first-generation students and other students is due to differences in the time of enrollment.

As discussed in the “Data Analysis System” section, all analyses included in PEDAR reports must be based on the DAS. Exclusively using the DAS in this way provides readers direct access to the findings and methods used in the report so that they may replicate or expand on the estimates presented. However, the DAS does not allow users access to the raw data, which limits the range of covariation procedures that can be used. Specifically, the DAS produces correlation matrices, which can be used as input in standard statistical packages to produce least squares regression models. This means that logit or probit procedures, which are more appropriate for dichotomous dependent variables cannot be used.9 However, empirical studies have shown that when the mean value of a dichotomous dependent variable falls between 0.25 and 0.75 (as it does in this analysis), regression and log-linear models are likely to produce similar results.10

The independent variables analyzed in this study and subsequently included in the multivariate model were chosen based largely on earlier empirical studies (cited in the text), which showed significant associations with the key analytic variable, bachelor’s degree attainment. Before conducting the study, a detailed analysis plan was reviewed by a Technical Review Panel (TRP) of experts in the field of higher education research and additional independent variables requested by the TRP were considered for inclusion. The analysis plan listed all the independent variables to be included in the study. The TRP also reviewed the preliminary results as well as the first draft of this report. The analysis plan and subsequent report were modified based on TRP comments and criticism.

Missing Data and Adjusting for Complex Sample Design

The DAS computes the correlation matrix using pairwise missing values. In regression analysis, there are several common approaches to the problem of missing data. The two simplest approaches are pairwise deletion of missing data and listwise deletion of missing data. In pairwise deletion, each correlation is calculated using all of the cases for the two relevant variables. For example, suppose you have a regression analysis that uses variables X1, X2, and X3. The regression is based on the correlation matrix between X1, X2, and X3. In pairwise deletion, the correlation between X1 and X2 is based on the nonmissing cases for X1 and X2. Cases missing on either X1 or X2 would be excluded from the calculation of the correlation. In listwise deletion, the correlation between X1 and X2 would be based on the nonmissing values for X1, X2, and X3. That is, all of the cases with missing data on any of the three variables would be excluded from the analysis.

The correlation matrix produced by the DAS can be used by most statistical software packages as the input data for least squares regression. The DAS provides either the SPSS or SAS code necessary to run least squares regression models. The DAS also provides additional information to incorporate the complex sample design into the statistical significance tests of the parameter estimates. Most statistical software packages assume simple random sampling when computing standard errors of parameter estimates. Because of the complex sampling design used for the survey, this assumption is incorrect. A better approximation of their standard errors is to multiply each standard error by the design effect associated with the dependent variable (DEFT),11 where the DEFT is the ratio of the true standard error to the standard error computed under the assumption of simple random sampling. The DEFT is calculated by the DAS and displayed with the correlation matrix output.

Interpreting the Results

The least squares regression coefficients displayed in the regression tables in this report are expressed as percentages. Significant coefficients represent the observed differences that remain between the analysis group (such as students whose parents had a bachelor’s or higher degree) and the comparison group (i.e., first-generation students) after controlling for the relationships of all the selected independent variables. For example, in table 15, the least squares coefficient for students whose parents had a bachelor’s or higher degree is 8.1. This means that compared to first-generation students, roughly 8 percent more of the group whose parents had a bachelor’s or higher degree would be expected to attain a bachelor’s degree, after controlling for the relationships among all the other independent variables.