The variance estimation procedure had to take into account the complex sample design, including stratification and clustering. One common procedure for estimating variances of survey statistics is the Taylor series linearization procedure. This procedure takes the first-order Taylor series approximation of the nonlinear statistic and then substitutes the linear representation into the appropriate variance formula based on the sample design. For stratified multistage surveys, the Taylor series procedure requires analysis strata and analysis primary sampling units (PSUs). Therefore, analysis strata and analysis PSUs were created. The impact of the departure of the ELS:2002 complex sample design from a simple random sample design on the precision of sample estimates can be measured by the design effect.
Design effects. The ELS:2002 sample departs from the assumption of simple random sampling in three major respects: student samples were stratified by student characteristics, students were selected with unequal probabilities of selection, and the sample of students was clustered by school. A simple random sample is, by contrast, unclustered and not stratified. Additionally, in a simple random sample, all members of the population have the same probability of selection. Generally, clustering and unequal probabilities of selection increase the variance of sample estimates relative to a simple random sample, and stratification decreases the variance of estimates.
In the ELS:2002 base-year study, standard errors and design effects were computed at the first stage (school level) and at the second stage (student level). The school administrator questionnaire was the basis for the school-level calculations; however, two items from the library questionnaire were also included. For student-level calculations, items from both the student and parent questionnaires were used. Therefore, three sets of standard errors and design effects were computed (school, student, and parent), which is similar to what was done for NELS:88. Each of the three sets includes standard errors and design effects for 30 means and proportions overall and for subgroups.
The student-level base-year design effects indicate that the ELS:2002 base-year sample was more efficient than the NELS:88 sample and the HS&B sample. For means and proportions based on student questionnaire data for all students, the average design effect in ELS:2002 was 2.35; the comparable figures were 3.86 for NELS:88 sophomores and 2.88 for the HS&B sophomore cohort. For all subgroups, the ELS:2002 design effects are smaller, on average, than those for the HS&B sophomore cohort. The smaller design effects in ELS:2002 compared to those for NELS:88 sophomores are probably due to disproportional strata representation introduced by subsampling in the NELS:88 first follow-up. The smaller design effects in ELS:2002 compared to those for the HS&B sophomore cohort may reflect the somewhat smaller cluster size used in the later survey. The ELS:2002 parent-level design effects are similar to the student-level design effects. For estimates applying to all students, the average design effect was 2.24 for the parent data and 2.35 for the student data. For almost all subgroups, the average design effect was lower for the parent data than for the student data. The school-level design effects reflect only the impact of stratification and unequal probabilities of selection because the sample of schools was not clustered. Therefore, it could be expected that the design effects for estimates based on school data would be small compared to those for estimates based on student and parent data. However, this is not the case, as the school average design effect is 2.76. The reason for this is that the sample was designed to estimate students with low design effects. In addition to stratifying schools, a composite measure of size was used for school sample selection based on the number of students enrolled by race. This is different from the methodology used for NELS:88. The NELS:88 average school design effect in the base year study was considerably lower: 1.82.
The first follow-up design effects are lower for all respondents and for most of the subgroups than the base-year design effects. For the full sample, the design effect for males is the same as in the base year, the design effects for American Indian or Alaska Native and for multiracial respondents are greater than in the base year, and the design effects for the other 14 subgroups are lower than in the base year. For the panel sample, the design effects for American Indian or Alaska Native and for multiracial respondents are greater than in the base year, and the design effects for the other 15 subgroups are lower than in the base year.
The second follow up study design effects are lower for all respondents and for all of the common subgroups used in design effects calculations than the base-year and first follow-up design effects. The items used to compute the mean design effects were different in the third follow-up than in prior rounds because the design effects were not expected to change much across the four rounds of the study.
Coverage Error. In ELS:2002 base-year contextual samples, the coverage rate is the proportion of the responding student sample with a report from a given contextual source (e.g., the parent survey, the teacher survey, or the school administrator survey). For the teacher survey, the student coverage rate can be calculated as either the percentage of participating students with two teacher reports or the percentage with at least one teacher report. The teacher and parent surveys in ELS:2002 are purely contextual. The school-level surveys (school administrator, library media center, facilities checklist) can be used contextually (with the student as the unit of analysis) or in standalone fashion (with the school as the unit of analysis). Finally, test completions (reading assessments, mathematics assessments) are also calculated on a base of the student questionnaire completers, rather than on the entire sample, and thus express a coverage rate. “Coverage” can also refer to the issue of missed target population units in the sampling frame (undercoverage) or duplicated or erroneously enumerated units (overcoverage).
Completed school administrator questionnaires provide 99.0 percent (weighted) coverage of all responding students. Completed library media center questionnaires provide 96.4 percent (weighted) coverage of all responding students. Of the 15,360 responding students, parent data (either by mailed questionnaire or by telephone interview) were received from 13,490 of their parents. This represents a weighted coverage rate of 87.4 percent.
Nonresponse Error. Both unit nonresponse (nonparticipation in the survey by a sample member) and item nonresponse (missing value for a given questionnaire/test item) have been evaluated in ELS:2002.
Unit nonresponse. ELS:2002 has two levels of unit response (see table ELS-1): school response, defined as the school participating in the study by having a survey day on which the students took the test and completed the questionnaires; and student response, defined as a student completing at least a specified portion of the student questionnaire. The final overall school weighted response rate was 67.8 percent, and the final pool 1 weighted response rate was 71.1 percent. The final student weighted response rate was 87.3 percent. Because the school response rate was less than 70 percent in some domains and overall, analyses wereconducted to determine if school estimates were significantly biased due to nonresponse.
Nonresponding schools (or their districts) were asked to complete a school characteristics questionnaire. The nonresponding school questionnaire contained a subset of questions from the school administrator questionnaire that was completed by the principals of participating schools. (Of the 469 nonresponding eligible sample schools, a total of 437, or 93.2 percent, completed the special questionnaire.
The school and student nonresponse bias analyses, in conjunction with the weighting adjustments, were not successful in eliminating all bias. However, they reduced bias and eliminated significant bias for the variables known for most respondents and nonrespondents, which were considered to be some of the more important classification and analysis variables. The relative bias decreased considerably after weight adjustments, especially when it was large before nonresponse adjustment, and the relative bias usually remained small after weight adjustments when it was small before nonresponse adjustment).
Student-level nonresponse. For students, although the overall weighted response rate was approximately 87 percent, the response rate was below 85 percent for certain domains, so a student-level nonresponse bias analysis conditional on the school responding was also conducted. Some information on the characteristics of nonresponding students was available from student enrollment lists. On these lists, data were obtained on IEP status, race/ethnicity, and sex. These data were not provided by all schools (in particular, information on IEP status was often missing, and IEP information was typically relevant only for public schools). Consequently, only the school-supplied race/ethnicity and sex data, as well as the school-level data used in the school nonresponse bias analysis, were utilized in conducting the student-level nonresponse bias analysis.
For the student-level nonresponse bias analysis, the estimated bias decreased for every variable after weight adjustments were made. Therefore, the number of significantly biased variables decreased from 42 before adjustment to zero after adjustment.
Nonresponse bias analyses were conducted for the postsecondary transcript study, where response was defined as having a postsecondary transcript. Details for the nonresponse adjustment significance testing can be found in the ELS PETS Data File Documentation (Ingels et al. 2015).
Item nonresponse. There were no parent or teacher questionnaire items with a response rate that fell below 85 percent. However, there were 78 such items in the student questionnaire, including composites. Item nonresponse was an issue for the student questionnaire because, in timed sessions, not all students reached the final items. The highest nonresponse was seen in the final item, which was answered by only 64.6 percent of respondents.
At the school level, 41 administrator items had a response rate that fell below 85 percent (ranging from a high of 84.7 percent to a low of 74.6 percent). No library media center questionnaire items fell below the 85 percent threshold, nor did any facility checklist items. While the school-level items can often be used as contextual data with the student as the basic unit of analysis, these items are also, with the school weight, generalizable at the school level. Therefore, for the school administrator questionnaire, nonresponse rates and nonresponse bias estimates have been produced at the school level. While item nonresponse in the student questionnaire reflects item position in the questionnaire and the inability of some students to reach the final items in a timed session, nonresponse in the school questionnaire must be explained by two other factors: first, the nature of particular items; second, the fact that some administrators completed an abbreviated version of the questionnaire (the high nonresponse items did not appear in the abbreviated instrument).
Measurement Error. In the field test, NCES evaluated measurement error in (1) student questionnaire data compared to parent questionnaire data; and (2) student cognitive test data. See Education Longitudinal Study: 2002 Field Test Report (Burns et al. 2003).
Parent-student convergence. Some questions were asked of both parents and students. This served two purposes: first, to assess the reliability of the information collected; second, to determine who was the better source for a given data element. These parallel items included number of siblings, use of a language other than English, and parent/child interactions. Additional items on parents’ occupation and education, asked in both the parent and student interviews, were also evaluated for their reliability.
Parent-student convergence was low to medium, depending on the item. For example, the convergence on number of siblings is low. Although both parents and students were asked how many siblings the 10th-grader had, the questions were asked quite differently. It is not clear whether the high rate of disagreement is due to parents incorrectly including the 10th-grader in their count of siblings, the inaccurate reporting of “blended” families, or the differences in how the questions were asked in the two interviews. The parent-student convergence on parents’ occupation and education was about 50 percent, very similar to those of the NELS:88 base-year interview.
Reliability of parent interview responses. In the field test, the temporal stability of a subset of items from the parent interview was evaluated through a reinterview administered to a randomly selected subsample of 147 respondents. The reinterview was designed to target items that were newly designed for the ELS:2002 interview or revised since their use in a prior NELS interview. Percent agreement and appropriate correlational analyses were used to estimate the response stability between the two interview administrations. The overall reliability of parent interview responses varied from very high to very low, depending on the item. For example, the overall reliability for items pertaining to family composition and race and ethnicity is high; the overall reliability for items pertaining to religious background, parents’ education, and educational expectations for the 10th-grader is only marginally acceptable.
Cognitive test data. The test questions were selected from previous assessments: NELS:88, NAEP, and PISA. Items were field tested 1 year prior to the 10th- and 12th-grade surveys, and some items were modified based on field-test results. Final forms were assembled based on psychometric characteristics and coverage of framework categories. The ELS:2002 assessments were designed to maximize the accuracy of measurement that could be achieved in a limited amount of testing time, while minimizing floor and ceiling effects, by matching sets of test questions to initial estimates of students’ achievement. In the base year, this was accomplished by means of a two-stage test. In 10th grade, all students received a short multiple-choice routing test, scored immediately by survey administrators who then assigned each student to a low-, middle-, or high-difficulty second-stage form, depending on the student’s number of correct answers in the routing test. In the 12th-grade administration, students were assigned to an appropriate test form based on their performance in 10th grade. Cut points for the 12th-grade low, middle, and high forms were calculated by pooling information from the field tests for 10th and 12th grades in 2001, the 12th-grade field test in 2003, and the 10th-grade national sample. Item and ability parameters were estimated on a common scale. Growth trajectories for longitudinal participants in the 2001 and 2003 field tests were calculated, and the resulting regression parameters were applied to the 10th-grade national sample.
The scores are based on IRT, which uses patterns of correct, incorrect, and omitted answers to obtain ability estimates that are comparable across different test forms. In estimating a student’s ability, IRT also accounts for each test question’s difficulty, discriminating ability, and a guessing factor.
As part of an important historical series of studies that repeats a core of key items each decade, ELS:2002 offers the opportunity for the analysis of trends in areas of fundamental importance, such as patterns of coursetaking, rates of participation in extracurricular activities, academic performance, and changes in goals and aspirations.
Comparability with NLS:72, HS&B, and NELS:88. The ELS:2002 base-year and first follow-up surveys contained many data elements that were comparable to items from prior studies. Some items are only approximate matches, and for these, analysts should judge whether they are sufficiently comparable for the analysis at hand. In other cases, question stems and response options correspond exactly across questionnaires. These repeated items supply a basis for comparison with earlier sophomore cohorts (such as 1980 sophomores in HS&B and 1990 sophomores in NELS:88). With a freshened senior sample, the ELS:2002 first follow-up supports comparisons to 1972 (NLS:72), 1980 (HS&B), and 1992 (NELS:88). The first follow-up academic transcript component offers a further opportunity for cross-cohort comparisons with the high school transcript studies of HS&B, NELS:88, and NAEP.
Although the four studies have been designed to produce comparable results, they also have differences that may affect the comparability as well as the precision of estimates. Analysts should be aware of and take into account these several factors. In particular, there are differences in sample eligibility and sampling rates, in response rates, and in key classification variables, such as race and Hispanic ethnicity. Other differences (and possible threats to comparability) are imputation of missing data, differences in test content and reliability, differences in questionnaire content, potential mode effects in data collection, and possible questionnaire context and order effects.
Eligibility. Very similar definitions were used across the studies in deciding issues of school eligibility. Differences in student sampling eligibility, however, are more problematic. Although the target population is highly similar across the studies (all students who can validly be assessed or, at a minimum, meaningfully respond to the questionnaire), exclusion rules and their implementation have varied somewhat, and exclusion rates are known to differ, where they are known at all. For instance, a larger proportion of the student population was included in ELS:2002 (99 percent) than in NELS:88 (95 percent), which may affect cross-cohort estimates of change.
Sample design. Differences in sampling rates, sample sizes, and design effects across the studies also affect precision of estimation and comparability. Asian students, for example, were oversampled in NELS:88 and ELS:2002, but not in NLS:72 or HS&B, where their numbers were quite small. The base-year (1980) participating sample in HS&B numbered 30,030 sophomores. In contrast, 15,360 sophomores participated in the base year of ELS:2002. Cluster sizes within school were much larger for HS&B (on average, 30 sophomores per school) than for ELS:2002 (just over 20 sophomores per school); larger cluster sizes are better for school effects research, but carry a penalty in greater sample inefficiency. Mean design effect (a measure of sample efficiency) is also quite variable across the studies: for example, for the 10th grade, it was 2.9 for HS&B and 3.9 for NELS:88 (reflecting high subsampling after the 8th-grade base year), with the most favorable design effect, 2.4, for the ELS:2002 base year. Other possible sources of difference between the cohorts that may impair change measurement are different levels of sample attrition over time and changes in the population of nonrespondents.
Imputation of missing data. One difference between the SES variable in ELS:2002 and in prior studies arises from the use of imputation in ELS:2002. Because all the constituents of SES are subject to imputation, it has been possible to create an SES composite with no missing data for ELS:2002. For the HS&B sophomores, SES was missing for around 9 percent of the participants, and for NELS:88 (in 1990) for just under 10 percent.
Score equating. ELS:2002 scores are reported on scales that permit comparisons with reading and mathematics data for NELS:88 10th-graders. Equating the ELS:2002 scale scores to the NELS:88 scale scores was completed through common-item, or anchor, equating. The ELS:2002 and NELS:88 tests shared 30 reading and 49 math items. These common items provided the link that made it possible to obtain ELS:2002 student ability estimates on the NELS:88 ability scale. Parameters for the common items were fixed at their NELS:88 values, resulting in parameter estimates for the noncommon items that were consistent with the NELS scale.
Transcript studies. ELS:2002, NELS:88, HS&B, and NAEP were designed to support cross-cohort comparisons. ELS:2002, NAEP, and NELS:88, however, provide summary data in Carnegie units, whereas HS&B provides course totals. In addition, unlike previous NCES transcript studies, which collected transcripts from the last school attended by the sample member, the ELS:2002 transcript study collected transcripts from all base-year schools and the last school attended by sample members who transferred out of their base-year school.
Other factors should be considered in assessing data compatibility. There are some mode-of-administration differences across the studies (for example, ELS:2002 collected 2006 data via self-administration on the Web, as well as by CATI and CAPI; in contrast, NLS:72 and HS&B used paper-and pencil mail surveys). Order and context effects are also possible (questions have been added, dropped, and reordered, over time).
Comparability with PISA. A feature of ELS:2002 that expands its power beyond that of its predecessors is that it can be used to support international comparisons. Items from PISA were included in the ELS:2002 achievement tests. PISA, which is administered by the Organization for Economic Cooperation and Development, is an internationally standardized assessment, jointly developed by the 32 participating countries (including the United States) and administered to 15-year-olds in groups in their schools. ELS:2002 and PISA test instruments, scoring methods, and populations, however, differ in several respects that impact the equating procedures and interpretation of linked scores.
Table ELS-1. Unit-level and overall weighted response rates for selected ELS:2002 student populations, by data collection wave: 2002 through 2012
|Unit-level weighted response rate|
|Overall weighted response rate|
† Not applicable.
SOURCE: ELS methodology reports; available at https://nces.ed.gov/pubsearch/getpubcats.asp?sid=107.
1 The sample was randomly divided by stratum into two release pools and a reserve pool. The two release pools were the basic sample, with the schools in the second pool being released randomly within stratum in waves as needed to achieve the sample size goal. Also, the reserve pool was released selectively in waves by simple random sampling within stratum for strata with low yield and/or response rates, when necessary. Each time schools were released from the second release pool or the reserve sample pool, sampling rates were adjusted to account for the non-responding schools and the new schools.