
![]() |
![]() |
![]() |
||||||||||||||||||||||||||||||||||||||||
|
![]() |
|
||||||||||||||||||||||||||||||||||||||||
|
Crosstabular analyses identify important relationships; however, many variables used in crosstabular analyses may be correlated. For example, family income is related to socioeconomic status, which in turn is related to parental education levels. Hence, finding differences among family income groups, differences among socioeconomic status quartile groups, and differences among groups based on parental education levels are not unique findings. To identify the underlying effects for all three correlated variables, multivariate analyses describe the extent of differences in family income groups after adjustments for the relationships with socioeconomic status and parental education. Multiple linear regression was used to obtain means that were adjusted for covariation among a list of control variables.9 Adjusted means for subgroups were obtained by regressing the dependent variable on a set of descriptive variables such as gender, age, employment status, overall rigor of high school coursework, first-generation status, number of remedial courses taken in postsecondary education, and so on. Substituting ones or zeros for the subgroup characteristic(s) of interest and the mean proportions for the other variables results in an estimate of the adjusted proportion for the specified subgroup, holding the other variables in the equation constant. For example, consider a hypothetical case in which two variables, age and gender, are used to describe an outcome, Y (such as the percentage of students who left their initial institution). The variables age and gender are recoded into a dummy variable representing age, A, and a dummy variable representing gender, G:
The following regression equation is then estimated from the correlation matrix output from the DAS: To estimate the adjusted mean for any subgroup evaluated at the mean of all other variables, one substitutes the appropriate values for that subgroup's dummy variables (1 or 0) and the mean for the dummy variable(s) representing all other subgroups. For example, suppose Y represents leaving the initial institution and is being described by age (A) and gender (G), coded as shown above. The unadjusted mean values of these two variables are as follows:
Next, suppose the regression equation results are as follows: To estimate the adjusted value for older students, one substitutes the appropriate parameter estimates and variable values into equation 6.
This results in the following equation: In this case, the adjusted mean for older students is 0.325 and represents the expected outcome for older students who resemble the average student across the other variables (in this example, gender). In other words, the adjusted percentage who left the initial institution after controlling for age and gender, is 32.5 percent (0.325 x 100 for conversion to a percentage). It is relatively straightforward to produce a multivariate model using the DAS, since one of the DAS output options is a correlation matrix, computed using pairwise missing values. In regression analysis, there are several common approaches to the problem of missing data. The two simplest are pairwise deletion of missing data and listwise deletion of missing data. In pairwise deletion, each correlation is calculated using all of the cases for the two relevant variables. For example, suppose you have a regression analysis that uses variables X1, X2, and X3. The regression is based on the correlation matrix between X1, X2, and X3. In pairwise deletion the correlation between X1 and X2 is based on the nonmissing cases for X1 and X2. Cases missing on either X1 or X2 would be excluded from the calculation of the correlation. In listwise deletion the correlation between X1 and X2 would be based on the nonmissing values for X1, X2, and X3. That is, all of the cases with missing data on any of the three variables would be excluded from the analysis.10 The correlation matrix can be used by most statistical software packages as the input data for least squares regression. That is the approach used for this report, with an additional adjustment to incorporate the complex sample design into the statistical significance tests of the parameter estimates (described below). For tabular presentation, parameter estimates and standard errors were multiplied by 100 to match the scale used for reporting unadjusted and adjusted percentages. Most statistical software packages assume simple random sampling when computing standard errors of parameter estimates. Because of the complex sampling design used for the NPSAS survey, this assumption is incorrect. A better approximation of their standard errors is to multiply each standard error by the design effect associated with the dependent variable (DEFT)11, where the DEFT is the ratio of the true standard error to the standard error computed under the assumption of simple random sampling. It is calculated by the DAS and produced with the correlation matrix. |
||||||||||||||||||||||||||||||||||||||||||