The following describes several statistical procedures used in this report.
The descriptive comparisons in this report were tested using Student’s t statistic. Differences between estimates are tested against the probability of a Type I error4 or significance level. The significance of each group difference was determined by calculating the Student’s t values for the differences between each pair of means or proportions and comparing these with published tables of significance levels for two-tailed hypothesis testing (p < .05).
Student’s t values may be computed to test the difference between estimates with the following formula:
where E1 and E2 are the estimates to be compared and se1 and se2 are their corresponding standard errors. This formula is valid only for independent estimates. When estimates are not independent, a covariance term must be added to the formula:
where r is the correlation between the two estimates.5 This formula is used when comparing two percentages from a distribution that adds to 100. If the comparison is between the mean of a subgroup and the mean of the total group, the following formula is used:
where p is the proportion of the total group contained in the subgroup.6 The estimates, standard errors, and correlations can all be obtained from the DAS.
There are some hazards in using statistical tests for each comparison. First, comparisons based on large t statistics may appear to merit special attention. This can be misleading since the magnitude of the t statistic is related not only to the observed differences in means or percentages, but also to the number of respondents in the specific categories used for comparison. Hence, a small difference compared across a large number of respondents would produce a large t statistic.
A second hazard in using statistical tests is the possibility of a “false positive” or Type I error. In the case of a t statistic, this false positive would result when a difference between groups measured with a particular sample showed a statistically significant difference when there is actually no difference between these groups in the full population. The significance level, or alpha, of .05 selected for findings discussed as significant in this report indicates that a difference of the magnitude reported would be produced by chance no more than one time out of 20 with samples of the size used in this study when there was no actual difference in the group means in the full population.
There are many ways for members of the public and other researchers to make use of NCES results. The most popular way is to read the written reports. Other ways include obtaining and analyzing public use and restricted use data files, which allow researchers to carry out and publish their own secondary analyses of NCES data.
It is very important when reading NCES reports to remember that they are descriptive in nature. That is, they are limited to describing some aspect of the condition of education. These results are usefully viewed as suggesting various ideas to be examined further in light of other data, including state and local data, and in the context of the extensive research literature elaborating on the many factors predicting and contributing to educational achievement or to other outcome variables of interest.
However, some readers are tempted to make unwarranted causal inferences from simple cross tabulations. It is never the case that a simple cross tabulation of any variable with a measure of educational achievement is conclusive proof that differences in that variable are a cause of differential educational achievement or that differences in that variable explain any other outcome variable. The old adage that “correlation is not causation” is a wise precaution to keep in mind when considering the results of NCES reports. Experienced researchers are aware of the design limitations of many NCES data collections. They routinely formulate multiple hypotheses that take these limitations into account, and readers of this volume are encouraged to do likewise. NCES has a responsibility to try to discourage misleading inferences from the data presented and to educate the public on the genuine difficulty of making valid causal inferences in a field as complex as education. Our reports are carefully worded to achieve this end.
This focus on description, eschewing causal analysis, extends to multivariate analyses as well as bivariate ones. Some NCES reports go beyond presenting simple cross tabulations and present results from multiple regression equations that include many different independent (“predictor”) variables. This can be useful to readers, especially those without the time or training to access the data themselves. Because many of the independent variables included in descriptive reports are related to each other and to the outcome they are predicting, a multivariate approach can help users to understand their interrelation. For example, students’ enrollment intensity and employment while enrolled are associated with each other and are both predictors of degree attainment. What happens to the relationship between students’ enrollment intensity and degree attainment when students’ employment differences are accounted for? Such a question cannot be answered using bivariate techniques alone.
One way to answer the question is to create three variable tabulations, a method sometimes used in NCES reports. When the number of independent variables increases to four or more, however, the number of cases in individual cells of such a table often becomes too small to find significant differences simply because there are too few cases to achieve statistical significance. To make economical use of the many available independent variables in the same data display, other statistical methods must be used that can take multiple independent variables into account simultaneously.
Multiple linear regression is often used for this purpose: to adjust for the common variation among a list of independent variables.7 This approach is sometimes referred to as “commonality analysis,”8 because it identifies relationships that remain after adjustment for “common” variation. This method is used simply to confirm statistically significant associations observed in the bivariate analysis, while taking into account the interrelationship of the independent variables.
Thus, this multiple regression approach is descriptive. Significant coefficients reported in the regression tables mean that the independent variables have a relationship with the outcome variable that is unique, or distinct from its relationship with other independent variables in the model.
Multivariate description of this sort is distinct from both a modeling approach in which an analyst attempts to identify the smallest relevant set of causal or explanatory independent variables associated with the dependent variable or variables and an approach using one of the many varieties of structural equation modeling. In contrast, a multivariate descriptive or commonality approach provides a richer understanding of the data without needing to make any kind of causal assumptions, which is why descriptive multivariate commonality analysis is often used in NCES statistical reports.
When should commonality analysis be employed? It should be used in statistical analysis reports when independent variables are correlated with both the outcome variable and with each other. This will allow the analyst to determine how much of the effect of one independent variable is due to the influence of other independent variables, because in a multiple regression procedure these effects are adjusted for. For example, because the strength of the statistical relationship between students’ enrollment intensity and degree attainment may be affected by employment, computing a multiple regression equation that contains both variables allows the analyst to determine how much, if any difference in degree attainment between full-time and part-time students is due to their differences in employment.
As discussed in the Data Analysis System section above, all analyses included in PEDAR reports must be based on the DAS, which is available to the public online (http://nces.ed.gov/das). Exclusively using the DAS in this way provides readers direct access to the findings and methods used in the report so that they may replicate or expand on the estimates presented. However, the DAS does not allow users access to the raw data, which limits the range of covariation procedures that can be used. Specifically, the DAS produces correlation matrices, which can be used as input in standard statistical packages to produce least squares regression models. This means that logit or probit procedures, more appropriate for dichotomous dependent variables, cannot be used.9 However, empirical studies have shown that when the mean value of a dichotomous dependent variable falls between 0.25 and 0.75, regression and log-linear models are likely to produce similar results.10 Regressions were conducted for three dependent variables in this report: completing any degree, completing a bachelor’s degree, and persisting overall. For completing any degree by 2001, the overall rate is 51 percent (64 percent for exclusively full-time students, 45 percent for part-time students who looked like full-time students, and 34 percent for other part-times students) (table 12). For completing a bachelor’s degree, the overall rate is 29 percent (44 percent for exclusively full-time students, 25 percent for part-time students who looked like full-time students, and 7 percent for other part-time students), and for overall persistence, the overall rate is 65 percent (72 percent for exclusively full-time students, 69 percent for part-time students who looked like full-time students, and 52 percent for other part-time students). With one exception, all values are within acceptable limits described above. The exception is for the bachelor’s degree completion rate for other part-time students (7 percent); thus, the regression estimates on this dependent variable for this group was omitted from table 15.
The independent variables analyzed in this study and subsequently included in the multivariate model were chosen based largely on earlier empirical studies (cited in the text), which showed significant associations with the key analytic variable, graduate enrollment, persistence, and attainment. Before conducting the study, a detailed analysis plan was reviewed by a Technical Review Panel (TRP) of experts in the field of higher education research, and additional independent variables requested by the TRP were considered for inclusion. The analysis plan listed all independent variables to be included in the study. The TRP also reviewed the preliminary results, as well as the first draft of this report. The analysis plan and subsequent report were modified based on TRP comments.
The DAS computes the correlation matrix using pairwise missing values. In regression analysis, there are several common approaches to the problem of missing data. The two simplest approaches are pairwise deletion of missing data and listwise deletion of missing data. In pairwise deletion, each correlation is calculated using all of the cases for the two relevant variables. For example, suppose you have a regression analysis that uses variables X1, X2, and X3. The regression is based on the correlation matrix between X1, X2, and X3. In pairwise deletion, the correlation between X1 and X2 is based on the nonmissing cases for X1 and X2. Cases missing on either X1 or X2 would be excluded from the calculation of the correlation. In listwise deletion, the correlation between X1 and X2 would be based on the nonmissing values for X1, X2, and X3. That is, all of the cases with missing data on any of the three variables would be excluded from the analysis.
The correlation matrix produced by the DAS can be used by most statistical software packages as the input data for least squares regression. The DAS provides either the SPSS or SAS code necessary to run least squares regression models. The DAS also provides additional information to incorporate the complex sample design into the statistical significance tests of the parameter estimates. Most statistical software packages assume simple random sampling when computing standard errors of parameter estimates. Because of the complex sampling design used for the survey, this assumption is incorrect. A better approximation of the standard errors can be made by multiplying each standard error by the design effect associated with the dependent variable (DEFT),11 where the DEFT is the ratio of the true standard error to the standard error computed under the assumption of simple random sampling. The DEFT is calculated by the DAS and displayed with the correlation matrix output.
The least squares regression coefficients displayed in the regression tables B-2 through B-5 are expressed as percentage points. Significant coefficients represent the observed differences that remain between the analysis group (e.g., students whose parents had a high school education) and the comparison group (e.g., students whose parents held graduate degrees) after controlling for the relationships of all selected independent variables. For example, in table 14, the least squares coefficient for exclusively part-time students who looked like full-time students attaining a degree or certificate is –35.6. This means that compared with full-time students, the percentage of exclusively part-time students who looked like full-time students who attained a degree was roughly 36 percentage points lower, after controlling for the relationships among all other independent variables.