Appendix A: Technical Appendix — Logistic Regression Analysis and Imputation Procedures

In chapter 8 of this report, two logistic regression analyses were conducted to explore factors associated with students' immediate enrollment in postsecondary education after high school and their attainment of an associate's or bachelor's degree within 6 years of beginning postsecondary education. Multivariate analyses, such as logistic multiple regression models, provide information on whether group differences in immediate postsecondary enrollment and degree attainment persist after controlling for student, family, and school/institutional characteristics. The analysis for the first model, immediate postsecondary enrollment, was conducted using data from the Education Longitudinal Study of 2002 (ELS:2002), including variables from the base year (2002), first follow-up (2004), and second follow-up (2006). The analysis for the second model, attainment of an associate's or bachelor's degree within 6 years of beginning postsecondary education, was conducted using data from the Beginning Postsecondary Students Longitudinal Study (BPS:04/09), including data from the base year (2004), first follow-up (2006), and second follow-up (2009). Descriptions of the ELS:2002 and BPS:04/09 surveys are provided in the Guide to Sources section of this report.

This technical appendix provides details on the logistic regression models used with the ELS:2002 and BPS:04/09 analysis datasets. In addition, this appendix provides details on the procedures used to impute missing data for key variables used in the ELS:2002 logistic regression model. The BPS:04/09 dataset variables were imputed before release, so no additional imputation procedures were performed. The appendix concludes with a glossary of definitions of the ELS:2002 and BPS:04/09 variables used in the logistic regression models.

Logistic regression procedures

The analyses conducted in chapter 8 employed the technique of logistic regression for categorical outcomes, which produces coefficients estimating the relationship between independent variables on the probability of the dependent outcome. To aid in the interpretation of results, the effect of a change in a given independent variable, X , is transformed into an odds ratio and the percentage likelihood of the dependent outcome. The formula for calculating an odds ratio is

exp (betaj)c

where exp equals base e (a constant equal to 2.71828182845904, the base of the natural logarithm), beta equals the logistic regression coefficient (represented in the equation as an exponent), and c equals the number of units of change in X (e.g., 1, 2, 30). For categorical variables, the value of c is set to 1 and the odds ratio equals the exponent of the logistic regression coefficient. Both the ELS:2002 and BPS:04/09 logistic regression analyses were conducted with the SUDA AN-callable procedure "PROC RLOGIST" using SAS, version 9.2. For the ELS:2002 analyses, the multiple imputation option was included in the SUDA AN procedure to include five imputed datasets, which are discussed in the next section ("Imputation procedures for ELS:2002 data"). The ELS:2002 analysis was weighted by the full sample weight (F2BY WT), and standard errors were calculated using balanced repeated replication (BRR) procedures with the replicate weights (F2BYP1 – F2BYP200). The BPS:04/09 analysis was weighted by the full sample weight (WTB000), and standard errors were calculated using BRR procedures with the replicate weights (WTB001 – WTB200).

In the ELS:2002 logistic regression model, the binary dependent variable was an indicator of whether an on-time 2004 high school graduate enrolled immediately in a postsecondary institution (i.e., by December 2004). Multiple categorical and continuous independent variables were entered simultaneously for the regression analysis to allow for the interpretation of relationships between each independent variable and immediate postsecondary enrollment, after controlling for other independent variables included in the model. The independent variables included in the ELS:2002 regression model include student's sex, race/ethnicity, socioeconomic status, family composition (i.e., number of parents/guardians in the household), standardized 10th-grade mathematics test score, 9th-grade GPA, previous grade retention status, sports and extracurricular activities participation status, number of absences from school, number of times skipped classes, parent engagement in discussing coursework with student, number of hours worked per week, and number of close friends who dropped out of high school. Only students who graduated from high school by August 2004 were used in the logistic model for immediate postsecondary enrollment.

In the BPS:04/09 logistic regression model, the binary dependent variable was an indicator of whether a recent high school graduate who began postsecondary enrollment in academic year 2003–04 attained an associate's or bachelor's degree by June 2009 (i.e., within 6 years of entering postsecondary education). Multiple categorical and continuous independent variables were entered simultaneously for the regression analysis to allow for the interpretation of relationships between each independent variable and associate's or bachelor's degree attainment, after controlling for other independent variables included in the model. The independent variables included in the BPS:04/09 regression model include student's sex, race/ethnicity, parents' educational attainment, income quartile in 2004, highest level of high school mathematics, indicators for college-level credits earned in high school, SAT/ACT test taking, control and level of first postsecondary institution, whether the student declared a major during the first year, whether remedial classes were taken in the first year, whether the student met with advisor in the first year, school club and sports participation status in the first year, number of hours worked per week, attendance intensity pattern through 2009 (e.g., always enrolled full time), and number of "stopouts"12 and transfers through 2009. Only students who graduated from high school in the year prior to entering postsecondary education were used in the logistic model for degree attainment.

Associations between student characteristics and the two outcome variables were examined for the full sample as well as separately for males and females; separately for Whites, Blacks, and Hispanics; and separately for males and females within each of these racial/ethnic groups. Multivariate analyses were not conducted for Asians, Native Hawaiians/Pacific Islanders, or American Indians/ Alaska Natives due to small sample sizes. Also, for the Black male and female and Hispanic male and female subgroup models, some of the results that appear to be substantive in magnitude are not statistically significant due to small subgroup sample sizes.

The global fit of the full sample and subgroup logistic models were assessed using different diagnostic measures, including chi-squared statistics, pseudo r squared values, and measures of the increase in the percentage accuracy in classifying cases on the dichotomous outcome variable based on comparisons between an intercept-only model and a fully specified model. The Likelihood Ratio (LR) Chi-Square test results indicate that the fully specified model is a better fit than the intercept-only model for the full sample and subgroup logistic models. The diagnostic results also indicate that the percentage accuracy in predicting the outcome variable increased across all of the logistic regression models when the selected independent variables were included. For example, the overall ELS:2002 logistic regression model percentage accuracy increased from 69.65 percent for the intercept-only model to 77.12 percent for the fully specified model. For the BPS:04/09 model, the percentage accuracy increased from 53.17 percent for the intercept-only model to 75.47 percent for the fully specified model. The diagnostic results indicated adequate global fit of both regression models.

Odds ratios are calculated for each of the categorical independent variables used in the regression models and represent the likelihood of students in one category of an independent variable (referred to as the identity group) completing an event relative to a reference group. If the event is equally likely to occur for both groups, then the odds ratio value equals one. If a category has an odds ratio that is less than one, then students in the identity group have lower odds of immediate postsecondary enrollment than students in the reference group. For example, the odds ratio of 0.65 for males (table ELS-2) is the ratio of the odds of males immediately enrolling in postsecondary education after high school to the odds of females immediately enrolling, after accounting for the effect of all of the other predictor variables in the model. The odds ratio of 0.65 indicates that the odds of a male immediately enrolling in postsecondary education after high school graduation are 35 percent lower [computed as ((odds ratio – 1) × 100) = ((0.65 – 1) * 100)] than the odds for a female (i.e., males are less likely than females to immediately enroll in postsecondary education). In this example, females are the reference group for the predictor variable. If a group category has an odds ratio greater than one, then students in the identity group are more likely to exhibit a certain outcome than students in the reference group. For example, the odds ratio of 1.63 for students who first enrolled in a 4-year postsecondary institution (table BPS-2) indicates that such a student has 63 percent higher odds of attaining a degree within 6 years than a student who first enrolled in a less-than-4-year institution. For continuous predictor variables, such as standardized test scores or number of postsecondary institution transfers, results are also interpreted in the form of odds ratios based on one unit of change in the independent variable. For example, in table ELS-2, the odds ratio of 1.88 for 9th-grade GPA indicates that a one-point increase in a student's 9th-grade GPA value (e.g., from a 2.0 to a 3.0) is associated with an 88 percent increase in the odds of the student immediately enrolling in postsecondary education. Asterisks (*) are used in the chapter tables to denote findings that are statistically significant at the .05 level.

Imputation procedures for ELS:2002 data

Prior to conducting logistic regression analyses with the ELS:2002 data, sequential regression multiple imputation (SRMI) was used to impute missing values for the subset of variables that were planned for inclusion in the analysis. This method was implemented in IVEware: Imputation and Variance Estimation Software®. Research Triangle Institute conducted the imputation procedures and prepared the technical documentation for the analysis. This section provides justification for using the SRMI imputation method and details about the steps taken to conduct imputation procedures for the purposes of this report. More information about the SRMI procedure can be found in Raghunathan et al. (2001).

The SRMI methodology provides two main advantages. The first is that it can be used to impute missing values for many types of variables—that is, categorical (binary and nominal), continuous, count, and mixed13—so that imputations are tailored to the specific type of variable that is being imputed. Categorical variables are imputed using logistic regression for binary variables and polychotomous regression for nominal variables. Continuous variables are imputed using linear regression. Count variables are imputed using Poisson regression. Mixed variables are imputed using a two-stage process: the first stage imputes a binary value, and the second stage imputes a continuous value for the first-stage imputed values that were imputed as a value of one. For each of these types of models, one can also include restrictions on observations that will receive an imputed value and bounds on the range of imputed values. The second advantage of the SRMI methodology is that it can use all of the available information in a dataset to impute each variable. That is, it takes advantage of all the variables in a dataset to produce the most informed and realistic imputed values. It can iterate through the variables in the dataset several times to reinforce the relationships among variables and improve the imputed values.

As a preliminary step, about 80 ELS:2002 variables were selected for imputation procedures. The variables selected included potential dependent and independent variables planned for the logistic regression model, as well as covariates that were not part of the model but were thought to be related to the variables with missing values. The 80 variables were assigned an appropriate "missing" code to be used in the imputation software so the software would recognize the data as missing and require imputation. Next, the variable type (i.e., categorical, continuous, count, or mixed) was identified for each variable. Lastly, bounds were set on the imputed values to identify the range of the valid responses for each variable. After these steps were completed, the data were ready for imputation. Fifty-five variables that were originally planned for use in the ELS:2002 logistic regression analysis required imputation. The percentage of missing values for these variables ranged from 0.02 to 33.64 percent (see exhibit A).

SRMI was conducted independently for each of the five imputed datasets that were created for this project. Below is a brief description of the methodology used for creating each dataset.

Let X be the matrix of variables that have no missing values. Also, let there be m variables with missing values ordered from the variable with the lowest percentage of missing values to the variable with the highest percentage of missing values. These variables are denoted by the vectors y1, y2, ..., ym. There were five iterations of imputations within each of the five imputed datasets. In the first iteration, y1 was regressed on X for the observations that had a valid value for y1. The information produced from this regression was used to impute for the missing values of y1 to create y1* ( y1* indicates that the y1 vector included the imputed values). Next, y2 was regressed on X and y1* and the information from this regression was used to impute the missing values of y2 (thus creating y2*). This process continued until ym was regressed on X, y1*, ..., ym-1*, and the missing values for ym were imputed, creating ym*. This completed the first iteration.

For the second through fifth rounds of imputation, the same general process was followed except that every variable (including the imputed values for the imputed variables) other than the variable being imputed was used in the regression. For the variable requiring imputation, the original variable (including the missing values) is modeled. For example, to impute yi (the original variable with missing values) in the second iteration, we regressed yi on X, y1*, ..., yi-1*, yi+1*, ..., ym*, using the imputed values, y1*, ..., yi-1*, yi+1*, ..., ym*, from the first iteration. For the third iteration, we regressed yi on X and the imputed values from the second round of imputation. For the fourth iteration, we regressed yi on X and the imputed values from the third round of imputation. Finally, for the fifth iteration, we regressed yi on X and the imputed values from the fourth round of imputation. After the fifth iteration was completed, the imputed values, y1*, ..., ym*, from the fifth round of imputation were retained for each of the five imputed datasets.

Applying this methodology to the ELS:2002 dataset, IVEware was used to produce five files that included the 55 variables selected for imputation and the other 25 variables that did not require imputation. Once the imputation procedures were completed, quality checks were performed to ensure that the imputed data had the same format as the original data. In addition, quality checks were developed specifically for both categorical and continuous variables. Distributions before and after imputation were visually reviewed to assess whether the imputed values were reasonable and to identify any significant deviations between the distributions. Furthermore, numeric checks were based on the percentages of each category for the categorical variables and on the quantiles represented by the minimum value, deciles, and maximum value for continuous variables. Large deviations in the relative proportions of imputed and unimputed values within categories or deviations for the imputed and unimputed densities for continuous variables would indicate a variable that should be investigated. The quality control checks did not detect any concerns with the imputation procedures.

One consideration in using SRMI is that it assumes that the dataset was generated from a simple random sample design. However, most complex survey designs involve stratification, clustering, and differential weighting. To account for this consideration, the survey design information—i.e., stratum and cluster (school)—and a weight were used in the imputation models.

Exhibit A: ELS:2002 variables requiring imputation

Variable name Variable label Type Count Skip Valid Missing Missing (%) Response (%)
F2B01 Ever applied to postsecondary school Categorical 15,689 1,650 14,036 3 0.02 99.98
F2A02 Type of high school credential received—diploma/certificate/GED Categorical 15,689 12,878 2,808 3 0.11 99.89
F2PSEND Last period of postsecondary education (i.e., persistence) Categorical 15,689 5,155 10,513 21 0.2 99.8
F2PSSTRT When started postsecondary education Categorical 15,689 5,155 10,513 21 0.2 99.8
F2PS1FTP Enrollment intensity at first postsecondary institution Categorical 15,689 5,155 10,511 23 0.22 99.78
F2B22 Major declared/undeclared Categorical 15,689 7,114 8,551 24 0.28 99.72
F2B18A Talk with faculty about academic matters outside of class Categorical 15,689 5,155 10,500 34 0.32 99.68
F2B18B Meet with advisor about academic plans Categorical 15,689 5,155 10,492 42 0.4 99.6
F1S15 Diploma or certificate most likely to receive Categorical 15,689 1,506 14,119 64 0.45 99.55
F2B18G Participate in other extracurricular activities Categorical 15,689 5,155 10,480 54 0.51 99.49
F2PS1REM Took math/writing/reading remedial course at 1st postsec institution Categorical 15,689 1,542 14,072 75 0.53 99.47
F2B18E Participate in intramural or nonvarsity sports Categorical 15,689 5,155 10,471 63 0.6 99.4
F2B18F Participate in varsity or intercollegiate sports Categorical 15,689 5,155 10,470 64 0.61 99.39
F1S14 Grade level (at first follow-up) Categorical 15,689 2,064 13,541 84 0.62 99.38
F1S21C Took or plans to take SAT or ACT Categorical 15,689 2,064 13,447 178 1.31 98.69
F1S65A How many friends dropped out of high school Count 15,689 826 14,634 229 1.54 98.46
BYS37 Importance of good grades to student Categorical 15,689 884 14,545 260 1.76 98.24
BYXTRACU Number of school-sponsored activities participated in 01–02 Count 15,689 884 14,526 279 1.88 98.12
F1WRKHRS F1 hours worked per week during 03–04 school year Mixed 15,689 826 14,566 297 2 98
F1S65D How many friends plan to attend 4-year college/university Count 15,689 826 14,557 306 2.06 97.94
F1S65B How many friends plan to have full-time job after high school Count 15,689 826 14,548 315 2.12 97.88
F2B29A No longer enrolled due to completion of degree/certificate Categorical 15,689 13,730 1,917 42 2.14 97.86
F2C31P Hours worked weekly during 2005–2006 school year—categorical Continuous 15,689 8,930 6,602 157 2.32 97.68
F1S65C How many friends plan to attend 2-year community college or technical school Count 15,689 826 14,501 362 2.44 97.56
F1RGP9 GPA for all 9th-grade courses Continuous 15,689 1,294 13,995 400 2.78 97.22
F2C26P Hours worked weekly during 2004–2005 school year—categorical Continuous 15,689 8,938 6,515 236 3.5 96.5
BYS28 How much likes school Categorical 15,689 884 14,277 528 3.57 96.43
BYS57 Plans to continue education after high school Categorical 15,689 1,843 13,226 620 4.48 95.52
BYS24B How many times cut/skip classes Count 15,689 884 14,039 766 5.17 94.83
BYNSPORT BY number of interscholastic sports participated in at V or JV level Count 15,689 884 13,945 860 5.81 94.19
BYS33H Ever in dropout prevention program Categorical 15,689 884 13,935 870 5.88 94.12
BYS33L Ever in program to help prepare for college Categorical 15,689 884 13,911 894 6.04 93.96
BYS33I Ever in special education program Categorical 15,689 884 13,907 898 6.07 93.93
BYS33K Ever in career academy Categorical 15,689 884 13,866 939 6.34 93.66
BYS26 High school program-student self-report Categorical 15,689 884 13,857 948 6.4 93.6
BYS33G Ever in English as a Second Language program Categorical 15,689 884 13,844 961 6.49 93.51
BYS33D Ever in a remedial English class Categorical 15,689 884 13,720 1,085 7.33 92.67
BYS33E Ever in a remedial math class Categorical 15,689 884 13,685 1,120 7.57 92.44
BYP46 10th-grader ever held back a grade Categorical 15,689 2,491 12,178 1,020 7.73 92.27
BYS58 Type of school plans to attend Categorical 15,689 2,212 12,345 1,132 8.4 91.6
F1SARACE Individual race variables Categorical 15,689 0 14,304 1,385 8.83 91.17
F2HSATTM High school attainment indicator (academic risk) Categorical 15,689 0 14,270 1,419 9.04 90.96
BYS59A Has gone to counselor for college entrance information Categorical 15,689 2,212 12,220 1,257 9.33 90.67
BYS59B Has gone to teacher for college entrance information Categorical 15,689 2,212 12,220 1,257 9.33 90.67
BYS59C Has gone to coach for college entrance information Categorical 15,689 2,212 12,220 1,257 9.33 90.67
BYP09 Number of siblings who dropped out of high school Count 15,689 3,233 11,208 1,248 10.02 89.98
BYS56 How far in school student thinks will get Categorical 15,689 884 13,096 1,709 11.54 88.46
F1RGPA Transcript reported cumulative GPA Continuous 15,689 1,294 12,550 1,845 12.82 87.18
BYS86A How often discussed school courses with parents Categorical 15,689 884 12,248 2,557 17.27 82.73
BYS86B How often discussed school activities with parents Categorical 15,689 884 12,224 2,581 17.43 82.57
BYS86G How often discussed going to college with parents Categorical 15,689 884 12,097 2,708 18.29 81.71
BYS75 How many hours usually works a week Continuous 15,689 6,151 6,827 2,711 28.42 71.58
BYS90F Important to friends to finish high school Categorical 15,689 884 10,334 4,471 30.2 69.8
BYS90H Important to friends to continue education past high school Categorical 15,689 884 10,272 4,533 30.62 69.38
BYS91 Number of close friends who dropped out Categorical 15,689 884 9,824 4,981 33.64 66.36

Glossary of variables used in regression analyses

ELS:2002 variables

When started postsecondary education(F2PSSTRT). First period of attendance at the student's first attended postsecondary institution. For the logistic regression analysis, students were grouped into two categories: "immediate postsecondary enrollment" if they enrolled in their first "real" postsecondary institution by December 2004 and "no postsecondary enrollment" if they either enrolled in their first "real" postsecondary institution after December 2004 or they had no postsecondary enrollment through 2006.

First follow-up sex composite(F1SEX).For base-year students, this variable was constructed from the base-year student questionnaire or, where missing, from (in order of preference) the school roster or logical imputation based on first name.

First follow-up student's race/ethnicity composite (restricted)(F1RACE _R).This race/ethnicity variable includes seven categories: (1) American Indian or Alaska Native; (2) Asian or Pacific Islander, including Native Hawaiian; (3) Black or African American; (4) Hispanic, no race specified; (5) Hispanic, race specified; (6) more than one race; and (7) White. Categories 1, 2, 3, 6, and 7 exclude individuals of Hispanic or Latino origin. For presentation in this report, categories 4 and 5 are combined into "Hispanic or Latino." The ELS:2002 race variables reflect new federal standards that require collecting race separately from ethnicity and allow students to mark more than one choice for race. For base-year students, information on race/ethnicity was obtained from the base-year student questionnaire when available or (in order of preference) from the sampling roster, the parent questionnaire (if the parent respondent was a biological parent), or logical imputation based on other questionnaire items (e.g., surname, native language). For the logistic regression analysis, results for "American Indian or Alaska Native," "Native Hawaiian/other Pacific Islander," and "Other" were collapsed into a single "Other race" category due to small sample sizes.

First follow-up socioeconomic status composite (F1SES2). F1SES2 is a composite variable constructed from parent questionnaire data, when available, and from imputation or student substitutions, when not. SES is based on five equally weighted, standardized components: father's/guardian's education (F1FATHED), mother's/ guardian's education (F1MOTHED), family income (BYINCOME), father's/guardian's occupational prestige score (from F1OCCUFATH), and mother's/guardian's occupational prestige score (from F1OCCUMOTH). Father's and mother's education were based on parent reports when available; otherwise, on student reports. If still missing, they were imputed. Income was based on parent questionnaire information or imputed otherwise. The parent questionnaire was the preferred source of data for mother's and father's occupation. In the absence of parent questionnaire occupation data, student-supplied parent occupation information from the base year (for base-year respondents) was coded by project staff, if possible. Missing occupations were imputed.

First follow-up family composition (F1FCOMP).This variable indicates the student's family composition and was constructed using the reports of parents in 2002. It was coded into four categories: mother and father, mother or father and guardian, single parent (mother or father), and other. For the logistic regression analysis, students were grouped into two categories: "two-parent/guardian household" and "single-parent/guardian household."

Base-year mathematics standardized score (BYTX MSTD).The standardized T score provides a norm-referenced measurement of achievement: that is, an estimate of achievement relative to the population (spring2002 10th-graders) as a whole. It provides information on status compared to peers (as distinguished from an IRT-estimated number-right score, which represents status with respect to achievement on a particular criterion set of test items). The transformation to a familiar metric with a mean of 50 and standard deviation of 10 facilitates comparisons in standard deviation units.

GPA for all 9th-grade courses (F1RGP9). Students' 9th-grade GPA was taken from high school transcript data and represents the GPA for all 9th-grade courses, based on a four-point scale (A = 4.0; F = 0.0).

10th-grader ever held back a grade (BYP46).This variable, taken directly from the parent questionnaire, indicates parents' response to the question, "Was your tenth-grader ever held back a grade in school?"

Base-year number of interscholastic sports participated in at varsity or junior varsity level (BYNSPORT).This variable is constructed based on a set of eight interscholastic sports and indicates the number of these sports that the student participated in during the 2001–02 school year, regardless of the level of participation (junior varsity or varsity). The eight sports used as inputs for this variable are baseball, softball, basketball, football, soccer, "other interscholastic team sport," "individual interscholastic team sport," and cheerleading/drill team. For the logistic regression analysis, students were grouped into two categories: "participated in sports" and "did not participate in sports."

Number of school-sponsored activities participated in during 2001–02 (BY XTRACU).This variable is constructed based on a set of nine school-sponsored activities and indicates the number of these activities that the student participated in during the 2001–02 school year. The nine school-sponsored activities used as inputs for this variable are school band/chorus, a school play or musical, student government, academic honor society, school yearbook or newspaper, school service clubs, school academic clubs, school hobby clubs, and school vocational clubs. For the logistic regression analysis, students were grouped into three categories: "no extracurricular activities," "one extracurricular activity," and "two or more extracurricular activities."

How many times absent from school (BYS24C).This variable, taken directly from the student questionnaire, indicates how many times the student was absent from school in the first semester or term of the school year: "never," "1–2 times," "3– 6 times," "7–9 times," or "10 or more times." For the logistic regression analyses, the responses were collapsed into three categories: "absent 0–2 times," "absent 3– 6 times," and "absent 7 or more times."

How many times cut/skip classes (BYS24B).This variable, taken directly from the student questionnaire, indicates how many times the student cut or skipped class in the first semester or term of the school year: "never," "1–2 times," "3– 6 times," "7–9 times," or "10 or more times." For the logistic regression analysis, the responses were collapsed into two categories: "never skipped class" and "skipped class at least once."

How often discussed school courses with parents (BYS86A).This variable indicates students' response to the survey question, "In the first semester or term of this school year, how often have you discussed the following with either or both of your parents or guardians? a. Selecting courses or programs at school." Response options were "never," "sometimes," and "often." For the logistic regression analysis, all three response options were included.

How many hours usually works a week (BYS75). This student questionnaire variable is top-coded at 41 hours or more. All students who had ever worked for pay were instructed to report the number of hours they usually work/worked each week. Variable is based on BYS72 ("Have you ever worked for pay/are you currently employed?") and BYS75 ("How many hours do/did you work each week on your current or most recent job?"). For the logistic regression analysis, the data were collapsed into "no hours," "1 to 20 hours per week," and "more than 20 hours per week."

Number of close friends who dropped out (BYS91). This variable indicates students' response to the survey question, "Altogether, how many of your close friends have dropped out of school before graduating? (Do not include those who have transferred to another school.)." Response options include "none," "some," "most," or "all of them." For the logistic regression analysis, the categories were collapsed into "no friends dropped out of high school" and "one or more friends dropped out of high school."

BPS:04/09 variables

Attainment or level of last institution enrolled in through 2009(PRLVL6Y).Indicates the highest degree attained or, if no degree was attained, the level of the institution where the student was enrolled in the spring of 2009. Response options for this variable include "attained bachelor's degree," "attained associate's degree," "attained certificate," "no degree, enrolled at 4-year," "no degree, enrolled at less-than-4-year," and "no degree, not enrolled." For the logistic regression analysis, the categories "attained bachelor's degree" and "attained associate's degree" were collapsed into "attained a degree within 6 years of postsecondary enrollment" and the categories "no degree, enrolled at 4-year," "no degree, enrolled at less-than-4-year," and "no degree, not enrolled" were collapsed into "did not attain a degree within 6 years of postsecondary enrollment."

Gender(GENDER).Indicates the student's sex.

Race/ethnicity (RACE).This race/ethnicity variable includes eight categories: (1) White; (2) Black or African American; (3) Hispanic or Latino; (4) Asian; (5) American Indian or Alaska Native; (6) Native Hawaiian/ other Pacific Islander; (7) Other; and (8) more than one race. For the logistic regression analysis, the results for "American Indian or Alaska Native," "Native Hawaiian/ other Pacific Islander," "Other," and "more than one race" were collapsed into a single "Other race" category due to small sample sizes.

Parent's highest level of education(PAREDUC). Indicates the highest level of education of either parent of the student during the 2003–04 academic year. Response options for this variable include "don't know," "did not complete high school," "high school diploma or equivalent," "vocational or technical training," "less than 2 years of college," "associate's degree," "2 or more years of college but no degree," "bachelor's degree," "master's degree or equivalent," "first-professional degree," and "doctoral degree or equivalent." For the logistic regression, cases with values of "don't know" were dropped from the model; the "did not complete high school," "high school diploma or equivalent," and "vocational or technical training" categories were collapsed into a "HS diploma or less and vocational/technical training" category; the "less than 2 years of college," "associate's degree," and "2 or more years of college but no degree" categories were collapsed into a "some college, less than bachelor's degree" category; and the "bachelor's degree," "master's degree or equivalent," "first-professional degree," and "doctoral degree or equivalent" categories were collapsed into a "bachelor's or higher degree" category.

Income quartile in 2003–04(INCGRP).Indicates the income group of the student, based on total income in 2002 for independent students or parents of dependent students. Income groups were determined separately for dependent and independent students based on percentile rankings and then combined into one variable.

Highest level of high school mathematics(HCMATH).Indicates the highest level of mathematics that the student completed or planned to take, according to self-reporting on the standardized test questionnaire and student interview. Response options for this variable include "none of these," "algebra II," "trigonometry/algebra II," "pre-calculus," and "calculus." For the logistic regression analysis, the "algebra II" and "trigonometry/algebra II" categories were collapsed into an "algebra II/trigonometry" category, and the "pre-calculus" and"calculus" categories were collapsed into a "pre-calculus/calculus" category.

Earned any college level credits in high school (CRDHS04). Indicates whether the student earned any college credits while he/she was in high school.

SAT or ACT exams taken(TETOOK).Indicates whether the student took the SAT or ACT college entrance exam. A student is considered to have taken an exam if the agency or institution reports a test score or the student reports in the student interview having taken the test. Response options for this variable include "did not take SAT or ACT," "took only the SAT," "took only the ACT," and "took both the SAT and ACT." For the logistic regression analysis, the categories were collapsed into a "did not take SAT or ACT" and "took an SAT or ACT."

First institution control 2003–04(FCONTROL). Indicates the control of the first institution (public, private nonprofit, or private for-profit) that the student attended during the 2003–04 academic year.

First institution level 2003–04(FLEVEL).Indicates the level of the first institution that the student attended during the 2003–04 academic year. Response options for this variable include "4-year," "2-year," and "less-than-2- year." For the logistic regression analysis, the "2-year" and "less-than-2-year" categories were collapsed into a "less- than-4-year" category.

Major during first year 2003–04 (MAJORS). Student's major or field of study during the 2003–04 academic year. For the logistic regression analysis, responses were collapsed into two categories: "no major declared" and "major declared."

Remedial course 2004:Any taken (REMETOOK). Indicates whether the student took any remedial or developmental courses during the 2003–04 academic year.

Frequency 2004: Meet academic advisor (FREQ04C). Indicates whether or how often the student met with an advisor concerning academic plans during the 2003–04 academic year. For the logistic regression analysis, the "sometimes" and "often" categories were collapsed into a single "yes" category that indicated participation.

Frequency 2004: School clubs (FREQ04E). Indicates whether or how often the student participated in school clubs during the 2003–04 academic year. For the logistic regression analysis, the "sometimes" and "often" categories were collapsed into a single "yes" category that indicated participation.

Frequency 2004: School sports (FREQ04F). Indicates whether or how often the student participated in varsity, intramural, or club sports during the 2003–04 academic year. For the logistic regression analysis, the "sometimes" and "often" categories were collapsed into a single "yes" category that indicated participation.

Job 2004: hours worked per week(including work study) (JOBHOUR2).Indicates the average hours the student worked per week. For the logistic regression analysis, this continuous variable was categorized into three groups: "not working," "working less than 20 hours a week," and "working 20 or more hours a week."

Attendance intensity pattern through 2009(ENINPT6Y).Pattern of enrollment intensity for all months enrolled through June 2009. Response options forth is variable include "always full-time," "always part-time," and "mixed." For the logistic regression analysis,the "always part-time" and "mixed" categories were collapsed into a "not full-time" category.

Stopouts number any where through 2009 (STNUM6Y).Number of stopouts at all institutions attended, as of June 2009. A stopout is defined as a temporary withdrawal of 5 or more consecutive months from enrollment at a postsecondary institution.

Number of transfers as of June 2009(TFNUM6Y). Number of transfers between institutions between entry to postsecondary education and June 2009.

Top


12 A "stopout" is defined as a temporary withdrawal of 5 or more consecutive months from enrollment at a postsecondary institution.
13 A mixed variable is a continuous variable with a significant number of observations having a zero value. If one looked at a density plot of a mixed variable, there would be a spike at zero with the rest of the distribution taking on any number of shapes (e.g., approximately normal, skewed).